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Abst 



The purpose of this research was to determine -the number^ of distinct words 
loi printed school Ettglish. .A detailed analysis was done of a 7,26(rword 
sample from the Carroll, Davies and Richman, Word Frequency Book . 
Projecting from the sample to the. total vocabulary' of school English,, our 
best' estimatp^dt^ that it contains about 88,500 distinct words. ^ 

Furthermore, for every word",a child learns, we estimate that there are an 

\ ^ 

average of one to three addifional related words that should also be 

* * ^ 

understandable to the <^hildy th4 exact number;^ependingr-on-how"we-ti th«3^ 
child is able to utilize context and morphology to induce meanings . Based 
aH~otir aiialysis, a reconciiiaCidil of estimates of children's vocabulary 
size was undertaken. It showed that much of the extreme divergence in 
estimates is due to the definition of "^word" adopted. Our findJfegSt 
Indicate that even the most ruthlessly^ systjematic direct vocabulary 
instruction could nef.ther accouot fqr a signifj^cant proportion of all the 
words children actually learn, nor cover more than a modest proportion- of , 
the words they will encounter in school reading materials^ 
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«« 

laieoNumber of Words in /Printed Scfiool English 

Determining the absolute size of individuals^ vocabularies is of more^ 
than purely theoretical interest • Jf a student must learn 8,000 words by 

Bis or her senior year in high school, this goal might be reached via an ^ 

« ... 

-ambitious program of direct instruction^ If, on the other hand^ the -number 
of wof-ds to be learned were closer to 80,000^ this gqal- would be beyond the 
reapK of even the most int^siyV direct, instruction thajt c6ul<l be 
accomplished i^ij the time available. The absolute size ojjfocabuiaries also 
ha s implications for theories pf lear nin g and lan guage acquisition* If 



some fifeventh grade^s^Kave vocabularies of over 50,000 words, as is 

•estimated by some researchers, a thdory^of language ac;qu£si-tioh must 

^ ^ ^ ^ ,^ , 

include mechanisms that could account for this phenomenal Accomplishment* 

There is in, fact a substantial lack of agreement among researchers as 

to the absolute size of vocabulary at^^ariy given age or level of development 

(see Anderson & Freebody,. 1981). For example, estimates of average total 

Vocabulary size at third grade range from 2,000 words (Dupuy, 1974) to 

25,0^0 words (M. K. Smith, 1941). The' same two researches estimate the 

vocabularies of seventh graders to be around 4,760 and 51,000 words, 

resifectlvely. Somo of the reasons for such JLarge disparities between 

estimates a^ the source of words Y^.g., what* dictionary* or corpus to take 

as; representing English vocabulary, and how to choose a representative 

sample), testing methodsi (disagreements about when a word can be counted as 

"known," and, how to test such knowledge), and the definition of "word" 
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adopted (.disagreements about, for example, whether to include proper names. 



or under what conditions to couj^^^^erived words as separate items). 



Tt*ls with the thirjL.ofrthese issues* that we will primarily be 
, concerned here. Our goal is to* answer the question "How many dJLfferent 
wprds are there?" in a number, of ways, for a variety of criteria, for 
defining "distinct words." This will allow us* to reconcile estijnates- of 
vocabulary size based on different criteria v^bx counting .words^ Our 
technique will be to rebalibrate previous estimates using benchmarks 

derived from a corpus that we have analyzed in depth. 

.... 



A CotyuM^jg-Wor ds Representative of Printed School English 

Dictionaries are often used as- a starting point for building tests^ to 
estimate vo-abulary size,* although, as Carroll (1964) pointed out, this is 
a questionable practice*. The orgsitaization and inclusion, or exclusion ofc 
items in'a dict4.onary will reflect not only linguistic principles, but also 
diverse practical demands such "as page format and limitati ons o n overall 
size. ^And the estimates of vdcabul&ry size, that a given test pro'Huces are 
related to tha size of the^ diqtiohary that was used in constructing the 
test .(Lorge &^ ChalX,~r96"W^ 1941). It should be apparent that» a 



>pax 



dictionary is an unstable base from which to estimate vocabulary size. 

Further variation is introduced in the selection of items from the 
dictionary. Researchers differ in whether categories such as proper names, 
technical terms, or scientific names of flora and fauna are included, and 
in the criteria for determining which derived words are to be counted as 
separate .items. " ^ 
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Constructing or evaluating a test .which attempts to measure absolute 
vocabularjr size, therefore, depends on the answer to three questions: What 
«ptlrc.e of jjoifds should be used, what types of' words should be included or 

•/ " . ' . . . 

excluded, and under what conditions related words should be grouped ' 
together or treated as separate items. In this paper we will attempt to 
give principled answers to these questions. The goal is estimates o£ 
vocabulary size that are interpretable in terms of their implications for 
yocabulary>.instruction. , x 

We have chosen ^s our source of words Carroll, Davies, and Richmaa's 
(1971) Americjan Heritage Word r^requency Book (hencefprth, the WFB). This 
book is based on the'' American Heritage intermediate Corpus, which contains 
5,088,721 words of running text from pyer a thousand items of published 
materials in use in schools. Tnese were selected on the basis of a careful 
'survey "to represent, as nearly as possible, the range of required and 
recommended reading^to, which s:.udents are exposed in school grades three 
through nine in the United. St^ates" (p. xxi). The mat.erials sampled 
included textbooks, workbooks, kits, novels, poetry general nonfiction, 
encyclopedias, and magazines. The WFB summarizes the largest and most 
recent cdrpus of the written language children encounter in school. 
Furthermore, Carroll, Davies, and Plchman have been able to use the corpus 
to determine properties not just of the vocabulary contained in the' WFB, 
but of the total vocabulary of the type of materials from which the sample 
waJ collected. This total vocabulary is a theoretical construct, but its 
overall size (and. several other properties) can be predicted with a 
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Substantial degree of confidence. Thu.s, our analysis can be generali?:ed 

not just to the vocabulary in the wraj but to the entire population of 

V ^ ' ' ' \ - \ ■ ^- . " - 

which the WFB constitutes a representative sample^- Because of the way that 

f . • * * 

♦the American Heritage Intermediate Corpus was collected, we can justifiably 

refer to this population as "printed school English" (with the restriction 

to grades three through nine understood). ^' 

"Printed school English,". in this sense, gives us the basi^ for an 

operationaTcrdefiMtion of'-the eotal vocabulary of English, keeping iri mind 

' , r-^.^' . . ' ^ ' . 

that we are restricting ourselves* to written language intended largely for 
children. A vocabulary test based on this material could not be taken as a 
measure of a child^'s oral vocabulary, but would certainly" be appropriate as 
a measure of a' child's reading vocabulary. ' , . * 

One might be"' concerned at this point that written language intended 
for children is too restricted in vocabulary. ^This concern seems 
reasonable, but as it turns out it is not warrented. As we will see, even 
an unabridged .dictionary gives a more limited picture of English vocabulary 
than do the pirojections of Carroll and his associates from their sample to 
the total ^vocabulary of written materials used in schools. 

On Defining the Conbept " Word " 

Absolute vocabulary size can only be discussed in terras oJ^ome theory 
of relatedness among words. For example, the WFB is described^ as 
-containing 86,741 different words, or types. However, since the corpus was 



sorted by computer, "word" is defined as a graphically distinct sequence, qf 

7 



V 



/ 

characters bounded right and left by a space. By this definition, doctot, 
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\ » 

Doc^, and DOCTOR ,, being graphically distinct, are'^counted as three, 
different words. Obviously, a psychologically more realistic definition of 
•*word" will count thes.e three types as Instances of the "same word,** 

Dictionaries, have traditionally treated regular Inflectlonal^arlents, 

* ■ • . ' ''^ 

for example, walk , walk's , and walked , as being f^rms of the sape word, 

r 

This Is pedagoglcally Justifiable; by the time children reach' first grade, ^ 
they have ^normally learned the basics of English Inflection. If a child* 
hasrieamed^the word^ .antelope , no separate Instruction about the plural ' 
antelopes Is needed; chlldreycan autoro^y^-lcally apply the rules of regular/ 
pliirajlzatlon to nG^''^>Sf^^^^ , — 



Some dlctlpnarles take other types of lelatedness Into account when 
grouping words Into entries.. Many llsjt semantlcally transpaxenJL. 



derivatives as subentrles. For; example, the American Heritage School 
Dictionary gives meeknes^ nd meekly ^s subentrles under meek without / 
further definition^ Along similar lines, Thorndike (1921) /grouped adverbs 
ending In -ly under their-base- forms, thus counting sadly^nd sad as one 
word. From a ^theoretical perspective, Aronoff (1976) argued that words 
derived by totally productive word formation processes (e.g., -ness, -ly) 
should not be given separate entries in the lexicon. 

However, there is a great variety of type^ and dejgrees of relatedness 
among words that.raight be taken into consideration when esliimating 
vocabulary size, ranging from the transparent cases just mentioned to mote 
obscui-e relationships such as that between quiet ahd acquiesce. And there 
has been little agreement among vocabulary researchers as to how different 
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types of relatedness^mpjS^ords should be treated, ' The extremes run from 
counting inflectional varients^., separate words on the one hand, to a 
radical grouping such as in Dupuy (1974^,^whc excluded from his count of 
•*Basio Words**- almost all suffixed, prefixed, and compound items, since 
the$e could in some sense be considered to be^derived from more basic 
— word'^^nd hence ^t least partially redundant. It should be clear that ' 
decisions concerning how words should be counted will be a major factor in 
detferminlng the magnitude of estimates of -Tocabulary size. ivx 

Previous analyses of f elatedness among words have not provided^^n 
^T^equate i>asis-for meanlngf ul measures of Absolute vocabulary size; they 
each suffer from at least one^of a number .of weaknesses. Many take an 

etymological or historical, rather than synchronic, approach to 

relationi^hips among words, positing relationships based on information not 
available to the normal language learner. Some statistical analysis of, 
word formation have been limited to' prefixes, or to suffixes, or pethaps * 
both 'of these, wHTls neglecting compounding. Previous studies have usually 
adopted a si^igle' criteTfoh' of relatedness among words^ without 
distinguishing types or degrees of relatedness* ^Some studies are based on 
♦ wordiists^ suc^ 4s Thorndijce and Lorge (1944) whidh are now outdated. 

Becker, Dixon arid Anderson-Inman (1980) haveyperhaps Come closest to 

9 

our*purposcs in their analysis of a vocabulary list derived by modifying 
and updating Thorndike and Lorge (1944). They h^ve analysed a list of 
25,782 words into morphographs (minimal '"meaningful" units of written 
English), and assigned each word a root word which represents\'the smallest 
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word from which a given word ^ can be '•semantlcally derived/* This root word 
analjrsls does define patterns of Interrelatedness among words to a certain 
extent. For example, divide, divided, dividend , dividers, dividing , 
divisible ,' division , divisional , and divisor are r.elated in that all h^ve 

been assigned the s^e root word d'lvlde. 

• ■ 

However, 4. n their analysis, there are no distlnctldns mad^ between 
possible cypes or degreee-of relatedness. Also, relatedness. is defined on 
aii etymologicaj- rather than syhchroalc^basls, Fo)^ Example, mllleniyim was ' 
assfghed 'the root word ajcinual . It certainly possible for a .historical 
3rl4igul-6t~t>o~66e--tW relit-lonshtp ^n-form^ between'' these two words . 



but 



dubious that the normaJ^/ speaker of English; armed only with such knowledge 

t 



of morphology as can b^e gained from words currently in the langua^je, would 
find any but' a semantic reJ^tlonshlp, Animism and animosity were 



assigned 

the root word anima; in this case, the relationship in f orm ^may^bfe obvious,, ^ 
but the semantic relationship is rather distant. In the case of [polynomial 



and its root wqrd name, both the formal and semantic relationships are 
tenuous, * 

Analyses of affixes, for example, Thorndlke (1941) or Stauffer (1942), 
have also typically been .done ^pn an etymological ba^ls, e^'j segmenting 
fragile Into a root frag- and the suffix -lie , or deceive into the prefix 
de-- and the root -ceive. An exception to this is found In Harwood and 
Wright (1956) who specify in their counts which suffixed forms have a free 
base (e,g, acceptable ) and which do not (e^g. amiable ) , However, while 
thes€5 analyses do give an indication of the extent to which some suffixes 
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account for a portion of t'he overall vocbulary, they do notr provide a basi^ 
for estfltnatlng the overall size of .voca:>ulary, that is, they do not tell us 
wl^st percentage of words actually are derivable using a given suffix.- 

^ / ' ; ' X 

Rhode and Qjcohnell (1977) have analysfed a s'fe^ of vocabulary Items 
especially compiled to coyer words used.^in grades. However, their 

analysis, while including !3iuch useful information, focuse^ oii, types of 
letter-sound coriie^spondences, 3^ ^thatl their definitions of ^'prefix'* and 
"suffix" are not in terms of productive word-formation processes in today's 
English- For example,, their list of suffixes .includes the om of bottpm and 

-the--il-of perU- . • 

/ » ' 

In our analyses, we will approach the question of relatedness among 

wMds not solely in terms of similarity of form, or in terms of 

etymological relationships, but rather, in terms of .the relative ease or 

difficulty with tihich a child could either learn the. cleaning of that word, 

or infer its meaning in context while reading* Also, w^ will define 

different types and degrees of relatedness among words, so that we^^n 

adjust our definition, of "related" and "distinct" to match the knowledge 

of word-r elatedness of children a given age or ability level. 

• * • 

^ Method 

The data and statistical alialyses in the WFB provide a reliable \ 
starting ppint for investigating the vocabulary, of- printed school English. 
However, the definition of "word" adopted for the purpose of compiling the 
WFB is, as the authors would freely adinit, inappropriate for any linguistic 
or pedagogical estimate of vocabulary size. Our goal, then, is to 
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categorize the different types of words in the WFB, and how they are 

' • ' ^. ^1 ' ' " 

related to eaqh other, in, order to arrive. at a meaningful estimate of the 
number of different words in printed school Englisl^' 

A random sample of 7,260 words was drawn from the 86, /41 words in the 
WFB « This sample consists of 121 chunks of 60^ contiguous words. The 
chunks were approximately evenly distributed throughout the alphabefiica^t^* 
. i /^^^^*. Contiguous groups of words were taken because related words are 

i -f'^ ""^x / ^ * * ' • 

visually (but not always) close to each other in an alphabetical listing.. 
. TaWe 1 gives an exampie of 'a group of related words, or "word 
family," that is Jfound/iti one of -^^^ chunks in our sam'ple, The^ pattern of 
. interrelationships among 'these items is, somewhat complex.. It nlight be 
represented graphically as in Figure 1. Thi$ figure shows that there are 
ACt^JJ^-tipl^^^^^^anching structures, and that two words\may be related' via one^ or 
more interv^ing words. This figure does not distinguish between different 
types or degrees of relatedness among words. A more complete 
• representation would specify, for example, that the r.elationsliip between 
add and ADD is one of capitalization, while the relationship between 
addition and additional is *suf fixation^ 



Insert Table 1 anJ Figure 1 about here. 



The set of possible relationships can be represented in terras af^pairs 

of words, each pair representing two words which are adjacent and connected 

V 

by a line in Figure 1» This type of representation, as depicted in Table 
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2, was used in our analyses* For each word in ourysample, its "immediate 
ancestor" was foynd, that is, the word to which it is most closely related 
and which -is in some sense more ba^c thaa the target word. 



/' 

Ijisert Table 2 about here. 



\ 



In the majority of cases, tjie identity of the immediate ancestor is 

' . < ' ' ' \ 

not problematic. For an inflected form, e.g.^ adds, the Immediate ancestor 

•is the uninflected stem*or infinitive, add. For the past tense,* it would 

be the present (infinitive) form as well. For plurals, the immediate 

ancestor is the singular. For forms with a prefix, e.g., unknown, the 

immediate ancestor^ is the unprefixed form, known. For forms with a suffix, 

additional , the. immediate ancestor ^is the form without the suffix, 

addition. For compounds, e.g.,^ addition-subtraction , there are two 

immediate ancestors, one* for each part, in this case, addition and 

subtraction . * * ' % 

More problematic cases were treated as follows: If a word has both a 

prefix and suffix, as does undecided, one choses as the immediate ancestor 

the form that is seuantically closest. In this case, there is no word 

*undecide ,'*^so t1iat only one analysis is possible: undecided has as its 

immediate ancestor decided, which in turn has as its immediate ancestor 

decide. In a case such as reactivation there, are two reasonable analyses. 

On the other hand, .both analyses arrive at activate as an ancestor, and the 

choice will not make any difference in terms of the ultimate count of 

prefixes. and suffixes. • 
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In- some relationships, for examprle, that between multiple and the verb 
multiply , it is difficult to say which item is more "basic" than the other- 
We recognize all the dangers and complications of' saying that one word is 
-derived from" another. For* the purposes of analysing patterns of. 
inter r^latedness among the words in the corpus, it is necessary to break 
down the relationships into assymetrical dyads; however, we assign no 
theoretical weight to the directionality of the relationship. 

In some casesf the immediate ancestor of a given. item was not found in 
the corpus. For example, abatement and abates are both found, but not 

* ^ -■■ ■ * ■ — 

abat£. In this case, the it^m^abate. was added to th^ list, and flagged as 
a '"missing ancestor." Sometimes Intermediate forms were missing. In the 
group of words in Tables 1 and 2, for example, if the ward addend had not 
qccurred^in the corpus*, the* relationship between addends 'and add would have 
involved two steps, suffixation and piuralization. In our analyses we 
supplied such "missing links" wherever necessary, flagging them to mark 
tfiat they were not in the original list of words from the WFB > 

For each pair of items, the relationship between them was categorized* 

^ -> 

The basic categories used in our analyses are listed and exemplified in 

Table 3. A more detailed description of these categories and their special 

-? 

subcategories is found in Appendix A* 



Insert Table 3 about here. 
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Coding Semantic Relatedness ^ 

In addition to distinguishing among different types of formal 
relationships between a word and its immediate ancestor .(e.g., s.uffixation, 
preflxation, compounding), our coding system categorizes the semantic 
relationship between fehe two. For some pales, e.g^, tranquil/tranquility , 
the-^semantic relatipnship is fairly direct. Fo^ other pairs of words, it 
is more dls.tanj:, e.g.^^ fun /funny , ^llye/lively ^ or descend/condescend . 

An immediate problem in trying to characterize the semantic 
relationship between two words is' the fact that oiie or both of them may 
have a number of meanings. Before one can describe the semantic 
relationship between the two, one must first decide which two meanings are 
to be compared. ' . ^ 

"We have tackled this problem in our coding system by representing the • 
semantic relationship between two words in • terms of two dimensions. The . 
first represents the semantic relationship between the two most similar 
meanings of the two words. The second represents the relationship between 
the two most similar familiar meanings of . the two^words. 

^fliat constitutes a "familiar" meaning was necessarily. defined in a 
rather impressioi^stic fashion. Basically, a "familiar" ^leaning was 
defined as one which would be likely to opcur to an individual wheh seeing 
the word out of context. Given that 'people are relatively accurate at 
intuitively assessing the relative frequencies of different words (cf . 
Carroll, 1971, and Carroll et.al., 1971) it was hoped that an intuitive 
judgement as to the relative frequencies of word meanings would be adequate 
for the distinctions which were necessary to taake here. 
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The words carry ^and carriage Illustrate well the* dls|:ijiction ve have 
made be'^ween the relationship o/^he two most similar meanings and the 
relationshfp of the two most similar familiar meanings • The two most 
similar meanings of these words might be the following; 

, carry: , to hold or move (the body or part of the boay)'*in a certain 

way - * 

carriage: the .manner in which the body is held; pcture 
These definitions are from th^ American Heritage School Dictionary / which 
is based on the American Heritage Intermediate Corpus, the corpus also 
forming the iasis for the- Word Frequency Book . 

jpci^ post familiar meanings of these two words, .on the other hand, are 
probably the following: • ' » 

carry: to bear'^in one's* hands or arras, on one's shoulders or back, 

etc* , while moving; to transport or convey 
carriage: a four-wheeled passenger vehicle, usually drawn by h6rses 
These two meanings are also related, but not as directly as the first two 
cit*ed. Our semantic code for the relationship between carriage and carry 
(or between any word and its immediate ancestor) would consist of two 
digits, the first representing the degree of semantic relatfedness betweea 
the two moist similar meanings, the second representing the degree of 
relatedness between the two most similar familiar meanings^ 

Another two-digit* code was used to encode the relative familiarity of 
the meanings" represented by the two digits in the semantic code. 
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There are two further qualifications about the use of the two-digit 
semantic code. If the two most similar meanings of two words were also 
familiar meanings, then- the second digit was either used to encode the 
relationship l^tween other familiar meanings of the tyo words, or else was 
set equal to the" first digit. 

, Fof example, the word misera^ble has as its immediate ancestor misery ^ 

It also has two meanings, as in "he made her life miserable" and "miserable < 

weather." Both ofVhese meanings would be considered familiar meanings, the 

first being. perhaps slightly more frequent or salient, and definitely being 

somewhat more closely related to the meaning of misery. The^irst digit of 

the semantic; >V!ode was- "used to encode the meaning of miserable in "he made 

her life miserable.^ 'The "sefcond digit was used for the'meaning of miserable 

*' . ^^'-^ ^ . • ^ 

in •'miserable weather." / ' . 

: The analyses re'ported Here, unless specified otherwise, will be based 

on only the second of the two digits In the semantic code. We feel that 

the child's, experience, in learning the meaning of carriage , or figuring^ut 

its meaning in context, would be most accurately represented by dealitig 

withy the most familiar meanings of the word. It would underestimate the 

amount of semantic opacity involved in word-formation processes to always 

measure only the semantic distance between, the two most similar meanings of 

two related words. * - 



^0 



( 
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Degrees of Seman|:ic ' Relatedhe^s * C 

The American Heritage S^hbol Dictionary was used as the primary 
refe^nce for determining the meanings of words, since th^Ls dictionary is 
based on the cqSl^s^ have analysed, and thus reflects meanings that 
•^actually occurred in th^ cQrpus* Other dictionaries were also u^ed, 
primt^rily to determine the nature and existance of less familiar meanings ♦ 

The code for semanti'c relatedness was defined in terms of the • 
following question: Assuming that the child knew- the meaning of the 
immediate ancestor,, but not: the meaning of the target word,, to what, extent 
would the child be able to determine the meaning of the target word when 
encountering it in context while/ reading? Th^ following levels of coding 



were used: • > 



SEM 0« This indicates that the semantic nelationship between "^ti^get 
word and immediate ancestor is semantically. transparent. There are no 

«' > ' * 

semantic features in the target word that are not found, in the immediate 
ancestor, with the' possible exception of any semantic features that would 
to totally predictable from a change in part df speechl For example, if a 
child knows the word red and has any grasp of the suffice -ness, that child 
should be able t6 compute the meaning of the ^ord rednes^ even without any ' 
help at all from the contexts This is the level of semantic transparency 
associated with almost all regular inflections. It is also found in many 
compounds; If otie knows the meaning of plankton and butgers, the meaning of 
the rath^ novel word planktonburgers is easy to compute, without any help 
from the context. Many affixes are similarly transparent; knowledge^ of the 
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^ord misinterpret should almo8t*guarentee that a personvwould understand 
' the word misinterpretation ^ ' 

SEM ^'l*' This code means that the meaning of the target Item could be 

Inferred from the meaning of :^s Immediate ancestor with some, but minimal, 

help frcm coatext; almost any context should do. Any semantic components 
-In the target word beyond those In the Immediate ancestor, or different 
^rom them, would.be trivial and predictable even without help from context*. 

For example, the word entertainer may have- some connotations of 

professional or official status beyond the simple meaning "one who 

* «t 
• I- ' * 

entertains,- but these are usually associated with the suffix -er, and 

.therefore could be inferred by a reader even without , much contextual 

information. " * . > 

. SEM This code means that the meaning of the target item could be 

Inferred from the meaning of its Immediate ancestor with reaso.nable help 

from the context; "one exposure learning" would be^^ossible/ The^arget / 

word may <^ontaln ilontrlvlal semantic features different* from or in addition 

to the semantic features in the immediate ancestor, but these would require 

^ . ^ ^ . 

only a general sort of contextual information to be inferred* For example, 
the word- gunner means not just anyone who^ uses a gun, but normally is used 



for military ^personnel with the specific assignment of using or bperatl'ng 
guns. Presum&^^ly the semantic components specifying "military personr^el" 
would be Inferrable from tltfe general context in which the word was used; 
the context would most likely, for example, rule out an interpretation of 

0 

gunner as meaning "gunflghter." 
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SEM .3. This code means thatr the meaning ^of the-- target item included 
^ I ' ' ' J 

semantic features that were not inferrable from the meaning of the 

immediate ancestor without substantial help from the context. For example, 

the meanings of the words copper and he^d^ definitely contribute to the 



meaning of the word ecTpperhead . One coi/ld infer that it might mean 



something like "s^i&ething with a head/ina.de out of copper, or resembling 
copper, or of the color of copper." .Even with a context like 'mile walking 
through the wpods I almost stepped/ on a copperhead," however, one could not 
be sure whether the object in question was a snake, an Insect or spider or 
perhaps some rare antique copper coin. Even a phrase such as "Mtten by a 
copperhead" wouldn'^.t distinguish between snakes and spiders. 

SEM '4. This code mean^ 'thaf the meaning of the target word is related 
to/ the me'anihg of ' its immediate ancestor, but only distantly*. The 



relationship would probably not be apparent without being pointed out, and 

one would definitely not be likely to guess the exact meaning of the taiget 

^ / 

word if one knew on^y the meaning of the immediate ancestor. Examples of 

pairs of words wit:h this degree of semantic relatedness are: vicldus/vice , 

. /■ ^ • • 

farewell/well , motley/mottle , inertia /inert, or saucer /s auce . 

SEM _5 • /This code is used for a lack of any^ discernable semantic " 

connection — cases in which the meaning of the immediate ancestor would be 

of no use in learning or remembering -the meaning of the target word- 

Examples of such relationships are clerical/cleric , groovy /groove, 

dashboard/dash . (Remember that we are considering only relatively familiar 

meanings of each of these words.) 



SI 
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• Appendix B contains some additional examples of words and their. 

immediate ancestors illustrating each level of se)nan*:ic relatedness* 

* fn the original coding ^system, a further distinccion was^mcide for 

levels SEM 1, SEM 2, and SEM 3 between changes in meaning thafc were 

metaphorical versus nonmetaphorical changes or ex^^erisir. in meaning* Th?s 

distinction was collapsed in the analyses reported here* 

Another part of the coding system, was used to capture what might be ' 

called "semantic specialization" — that, is, cases in which the imme.diate 

ancestor might have a range of meanings ,^ and the target wrd only would 

relate to one, or a subset of these* (There are also cases In which the 

•target word might have a range of^meanings beyond those found in the 

immediate ancestor*) Because the semantic relationship between any two 

» * 

woirds can be very complex, the analyses reported here were limited to the 
consideration of the relationship between the two most similar fatiiliar 
meanings, as already mentioned* 

' Roughly speaking, SEM tf, SEM 1 and SEM 2 car* b.-. thought of as 
semantically transparent relationships; SEM 3 relationships involve 
significant unpredictable semantic information; SEM 4 is semantically 
obscure, and SEM 5 semantically opaque* 

Types of Words 

* 

Estimates of the total number of words in English differ not only in 
how words are counted — e*g*, whether ^derived forms are counted as separate 
from their bases or not — but also in terms of whether certain jclasses of 
words are counted at all* The WFB contains various special categories' of 
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words that are often excluded from counts of words: proper* names, numbers, 
formulae, compounds containing numbers, abbreviations, and nonwords 
(strings of characters that, clearly do not represent vocabulary items). 
Each item i^n our sample was marked as to whether it belonged in any of 
these categories. Details of the criteria used in coding^re given in 
Appendix C* 

^ ^ Unlike some vocabulary 'researchers, we did not mark words as rare., 

' - * •■ ^ . 

archaic', obsolete,- technical, or scientific names of flora or fauna^ If a 
word actually occurred in the WFB, children do encounter it in their school 
readihg; we consider this a justifiable operational criterion for defining 
the boundaries of printed school English. Rather than trying to come up 
with criteria for speciaHzed or technical vocabulary, we feel that such 
distinctions, if they become necessary, could be best defined .operationally 
-in terms of the actual distribution of words in the corpus* 

Results 

The fesult of our coding process was a list of 8,669 items, 7,260> 
being from the original sample, ^nd7t he rest added to Recount for missing 
;ancestors, disambiguations, and secorfd or other^merabers of compounds 
Each item* on the lig(t ha3 an immediate ancestor, if one exists, and a code 
representing what type of word it is and the morphological and semantic 
characteristics of its relationship to its immediate ancestor. 

From this list, we can count the number of items falling into each, of 
the word-type and relationship categories in our coding system. Table 4 I 
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Insert Table 4 about here. 



gives a nummary of the results. For each- catagory , this table gives five 
different figures. . Sample N is the number; of items in our sample falling 
linto this category; Sample ^ Is the percent of our sample which this 
categoty constitutes, i.e.' 100 k Sample!,N/7,260. The Corpus N is* the 
estimated numW of items in this category that would be' f oynd in the wholB 
^B . The Population N is the number of words in the total vocabulary of 
printed school .English (gtades 3 through 9) that would fall into this 
category. Population % is the percentage of words in this category in the . 
population, i.e. lOO x^Population N/,6p9,606.^ 

Since our sample is essentially a random sample of the WFB,'we can 
assume that the percentage of items in a category in our sample will be 
approximately the percentage of items in that category for the entire WPB. 
However, there is an important sense in which the/wFB (and. hence ou^ sample 
of it) is jiot representative of the population of words from which it. is 
dr^wn. As the analyses by Carroll, Davies,, arid/Richman (1971) indicate 
-(see Table"B^8~x>Tr^pT~xx:fcvr)/"alT of tRe r oughl^ L4, 000^ words in printed . 
school English with frequencies greater than 2.5 peif miliion would be 
expected to occur at least several times in the WFB. On the other hand, of 
the more than 200,000 words with a freqjaency of less than two per billion i 
less than 100 would be expected to show up in a corpus this si/e. Thus, in 
extrapolating from any corpus to the total vocabulary, a ve#y high 
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frequency word represents only Itsel^f, so to speak, whereas* a low frequency 



word must be taken as 



representative of a lal^e number of low frequency 



words which did notabtually appear in the corpus;' 

" - * ^ « . ' * 

. Our estimates of the composition of the population have taken this 

into account by assigning a weight to each word, which is an inverse^ 

I ^ 9 " ' 

function of its frequency This is why the Populacion % is often \ 

substahtially diffej^t from the Sample %• For example, 11,65Z of the 

words in pnr sample are morphologically basic. However, .it turns out 'that 

morphologically basic words are not evenly distributed by frequency- Among 

the most frequent 'words in our sample (those that' would occur" on the 

average twice or more in a million\ running words of text) almost .28% were * 

morphologically basic. However, among the less frequent words this 

N 

percentage decreased', averaging around 6% in the lower frequefncy ranges. 
The percentage of mori>hologically\basic words in the population (7.46%)' 
reflects the fact that: the population of words in printed school English 
has a ^higher proportion of low frequency mx^ than does the WFB or our 
\ sample* 

\ Table ^4 is organized as follows: First of all, the different coding 
cat^egories are arranged appr^oximately according t6 how they relate to 
possible definitions of "word." The first group of coding categories are 
those which would be counted as constituting "separate words" in many 
definitions of "word," and which would appear as separate entries in most 
dictionaries. The second group of coding categories are those that might 
not be considered separate words for some purposes, but would often have 
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separate entries in dictionaries. For example, mice might not always be 
considered to be a separate word from mouse , for the purpose of counting 
words, but it would occur as a separate entry in most dictionaries.. 

The third gfoiip of categories contains those such as regular 
inflections that would not normally occur af separate items in 
dictionaries. ' ' ^ 

The fourth group contains categories of proper names, which are 
excluded from some, h}it not all, dictionaries and estimates of vocabulai^y 
size. Proper names were further subdivided as follows: Ba tc proper names. 

are those proper names which werf also categorized as morphologically 

\ 

I?asic. Derived proper names are words derived from proper namesv by some 
word-formation process, i^e*, by suffixation, prefixation, compounding, or 
some morphologically idiosyncratic relationship. Inflectional and other 
varients of proper names include plurals and other varients of proper names 
that would not be given separate entries in a dictionary. Capitalizations 
homographic with proper names are those forms, such as Cliff , which might 
be either a. proper name or the capitalization of a, non-propei: name. Since 
the noncjtpitalized form cliff has already been counted elsewhere. We have 
counted these a? constituting proper names. In answer to the question "How 
many distinct proper names are there?** one would probably want to include 
all of these categories except for "inflectional and other varients of ^ 
proper names.** 

The remaining categories in Table 4 are those which would not normally 
be counted as separate words or be listed as words in a dictionary. 
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Note that the categories of special types of words — proper names, 
f^'mulae and numbers, ^compounds containing numbers, nonwords and foreign 
words^-are not included in the relationship categories in the first three 
groups • Thus, the category "morphologically basic words" actually includes^ 
only morphologically basic words which are not proper names, foreign words, 
numlrers", etc* . " 

Even without fiirther analysis, certain things are already clear about 
the estimated' vocabulary of printed school English* Most importantly, it 
is very large. By many definitions of "word," the population includes over 
200,000 words, and another 100,000 proper names* A large number of 
words — over 170,000 — ^are derived by suffixation, prefixation, and 
compounding, but there are Still quite a few (45,000) which are basic, that 
is, which* cannot be derived from any other word* 

The WFB alone contains a vocabulary larger than some estimates of the 
vocabulary size of average high school seniors — who should presimably be 
able to read any of the reading material for grades 3 through 9 without too 
much difficulty. - r 

In Table 5, estimates of the number of derived words in the population 
are broken down according to relationship type — suffixation, prefixation, 
compounding, and^:5^os3mcratic relationships — and by degree of semantic 

Insert Tabic 5 about (here. 
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relatedness. For some purposes we can divide the degrees of semantic 
relatedness ,in to two classes: SEM 0, SEM 1 and SEM 2 constitue those cases 
in which the relationship is essentially transparent. A child could, given' 
the meaning of the base, figure out the meaning of the derived form, 
perhaps with some help from context. SEM 3, SEM 4 and SEM 5, on the other 
hand, include derived forms whose meanings are not completely predictable 
from the meanings of their bases, so that they must in effect be learned as 
separate items. . - 

r 

From Table 5 we see that there are an estimated 139,020 der'ived forms 
in the population whose meanings are transparently related to the meanings 
of their bases* This suggests 'strongly thdt knowledge of word-formation 
processes opens up vast amouxits of vocabulary to the reader. Conversely, a 
reader who qannot take advantage of morphological relatedness* among words 
has iit some sense more than twice as many words to deal with as the reader 
whc utilizes these ^Relationships. 

There are also 43,080 derived forms that are relatively opaque 
semantically. The majority of these, 26,599 words, are at the- level SEM 3, 
which means that although the meaning of the derived form is not' completely 
predictable from the meanings of its component parts, the meanings of the 
component parts do in fact contribute something to the derived meaning. 
Even in these cases,, then, knowledge of word forijation processes will be 
helpful to the reader trying to figure out the meaning of words in context • 
On the other hand, however, the semantic opacity of these words i% 
sufficient that many readers — perhaps especially poor readers — ^will not be 
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^ble to figure out their meanings, and thus will have to learn them 
.Individually. 

Table 6 gives the same type of information as Table 5, but computed on' 




Insert Table 6 about here. 



a slightly different bdsis. In Table 5, /the degree of semantic 
relationship was based on familiar meanings of derived words and their 
immediate ancestors.* Table 6 is based on the minimal semantic distance 
between derived words and their immediate ancestors, that is, on the 
relationships between the most similar meanings for each pair of words. 
For example, in Table 5, the relationship between carry and carriage would 
be counted as relatively opaque, since only the familiar meanings are taken 
into' consideration For the purposes of Table 6, on the other hand, the 
semantic relationship between these two words would be counted as ^ 
transparent, since the most similar meanings were considered. Thus, Table 
6 minimizes the number of derived forms that would be considered opaque. 
Unless otherwise specified, we will use the figures from Table 5 in our 
discussions of vocabulary composition. 

The Number of Webster Main fen try Equivalents 

Exactly how many words -*there are in printed school English depends on 
the 'definition of "word** that is adopted. One way to get a meaningful 
measure is to take as a definition of "word'* the criteria for status as a 
main entry in Webster's Third New International Dictionary, unabridged.^ 



Words in School English 
r 

27 

This dictionary is of special interest be^cause it was used by Dupuy (1974) 
as a basis for choosing a set of "basic words'* to use in making estimates 
of absolute vocabulary size» The number of "Webster main entry 
equivalents** can be computed by including in our count of words the 
following categories from our coding system (see Table 4 and Appendx..es A 
and C): Morphologically basic words, idioj^ncratic morphological, 
relationships, suf fixation, prefixation, compounding and contractions, 

truncations, abbreviations, irregular ii>flectioris. Irregular 'comparatives 

1 

and superlatives, alternate forms of words, semantlcally irregular plurals, 
**sclentlfic plurals , *^ and -deirived proper names • The other categories in * 
Tabl« 4 would be excluded from this count 

Calculated in this way, the numbers of **Webster main entry 
equivalents" were as follows: 

Sample N ^ 3,156 
Sample % 43.47 \ 

Corpus N 37,707 ^ 

. Population % 39.88 
^Population N 243,136 
How does this compare with the number of words in Webster's Third ? 
Dupuy (1974), on the basis of a very careful count, estimated the number of 
main entries in Web's ter m's Third to be 240,000. (This number excludes main 
entries which were prefixes, *suffixes, letters and other than first-listed 
homographs, ^i»e« it includes only one main entry for each set of 
homographic words.) However, this estimate is not directly comparable with 
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our estimates ojE "Webster main entry equivalents," for the following 
reasons: 

1- ' Our estimates of "Webster main^entry equivalents" dq not^take into 
account the fact that i^^Webster's Third , there are separate main entries 
for regular inflections, comparatives, and superlatives that would fall 
more than five inches away from their associated main entry in the physical 
page layout* According to an estimate based on 10 randomly selected pages, 
about l.^Z of tihe raa>^n entries in Webster's Third, or about 3,360 entries,^ 
consist of such tegular inflections, comparativ^es, and superlatives* 

2» In Webster 's Third , many suffixed forms, jnostly in -ly and -ness, 
are- listed as subentries under their associated main entries- According to 
our estimates, for every 100 entries, there are about 5*02 such subentries- 

This would amount to 12,048 items in the whole dictionary. 

/ ' ^ 

2* Although Webster 's Third excludes most proper naraes, it does 
/ ^ 
incline some proper names that would have been coded as basic proper names 

I'll our sample* According to Dupuy (1974), there are 23,900 proper names in 

Webster 's Third * On /the basis of a ^mall sampling (12 randomly selected 

/ 

pages) we judge that about 31*25% of the proper names in Webster 's Third 
would have been coded as basic proper names in our coding system. This 
amounts to 7,469 entries. 

4* According to Dupuy's/ estimates , 29.2%, or 70,080 of the main 
entries in Webster 's Third are compound entries; that is, they consist of 
two or more words separated by spaces, such as heat exhaustion * On the 
other hand, the corpus of printed material used for the WFB was keypunched 



Words in' School English 

29 

in such a way as to exclude such items; with only a very few exceptions, 
potential compound entries were divided into their component words ♦ ' 

If we exclude from the count of main entries in Webster's Third all 
entries for regular inflections, comparatives and superlatives, and all 
basic proper names and compound entries, and if we add to this courij. the 
number of suffixed subentries, Ve have ^^figui^e which is directly^ 
comparable to the number of "Webster main entry equivalents" in our 
estimates for printed school English. The number of main entries in 
Webster 's Third , counted in this way, is 171, 139, Thus, somewhat 
surprisingly, it appears that there ar&.more words in printed school 
English than in an unabridged dictionary. >^ 

One .might wonder how this could be. ^art of the answer lies in the 
fact that books in these grade levels sample from a^very broad range of 
topics. Part, of the explanation must also lie ij^ the large number of 
derived words in printed school English. As T^le 5 shows; there are about' 
139,000 semantically transparent derived words, a little more than half of 
which are compounds. Many of ^these -derived forms, especially the 
compounds, are low-frequency words coined for specific purposes or 
contexts, and are not likely to.be found In any dictionary. Examples of 

such words would be essayist-poet , European-owned , ex-florist , and 

♦ 

everlengthening . The eXistance of large numbers of such words in school 
texts makes knowledge of word-formation processes an important factor in 
dealing with low-ftequency words. 
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. .Dupuy ^s Estimate of the Number of Words in English 

Dupuy (1974) undertook not only to construct a vocabulary test, but 
^ also to make It a meaningful measure of absolute vocabulary size. Any 
mea'sure of absolute vocabulary size presupposes a definition of "word;" 
Dupuy chose to trea,t vocabulary size* in -terms of Ba^ic Words, which ar©>^^ 
defined Ij^terms of the following criteria: 

Dupuy took as. his source of words Webster 's Th^ird New International 

Dictionary , unabridged. Main entries in this dictionary are "basic words" 

if they do not fall into any of the following excluded categories: 

t 

: (1) compound and hyphenated eatries, 

(2) proper names, 

(3) abbreviations, 

(4) j^ems which are not main entries in three other dictionaries: The 

Random House Dictionary of the English Language , The World Book 
Dictionary , atCd Funk and Wagnalls New Standard Dictionary of the 
English Language , 

(5) items listed as foreign, archaic, slang or in^rmal, or technical 

in the Random House Dictionary, 

(6) "derived, variant, or redundant" words. 

Dupuy estimated that there were 12,300 "basic words" in Webster 's ' 
Third, by applying these criteria to a representative 1% sample of this 
dictionary.- Usin^his 123 basic words (the 1% sample of 12,300) as a basis 
- fpr a vocabulary test, he has estimated vocabulary sizes at different grade 
levels: 2,000 words in 3rd grade, 4,760 words in 7th 'grade, and over 7,000 
words known by high school seniors. 
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Initial^ Comparison of Dupuy ^s Estimates with Ours . We have already 
seen, in our estimate of Webster main entry equivalents, that the 
vocabulary of printed school English is somewhat larger than Webster's 
Third. (The subset of^tpQ vocabulary of printed schoQl Ehgrish that 
actually occurs in the WFB is of course smaller, containing a little less 
than one quarter of the words that are^in the unabridged dictionary..) One 
might expect, then, that the number of basic^words in printed school 
English would be a little larger than Dupuy's estimate, while the number of 
basic Vords iti the WFB should be substantially smaller. 

To compare .our estimates of vocabulary size with Dupuy^s, we have to 
determine what would be .the close^t^-^'uivalent in our coding system to 
Dupuy's Basic Words We will explore this question in more detail below; as 
:an initial bassS for comparison, we would compare Dupuy's Basic Words with 
our category of morphologically basic words. According to our analyses, 
there are 10,108 morphologically basic words iu the WFB, and 45,453 in the 
population underlying that corpus. 

^guy (1974) claims to exclude from Basic Words those -derived words 
which are redundant because their "meanings could be understood with 
knowledge of the meaning of the word ^nd affix." We ^ould therefore' add to 
our count of basic words those derived words with the level of semantic 
transparency SEM 3, SEM 4 o^sm 5. This would bring the number of basic 
words in the WFB up to 16,655., and in the population, to 88,533. 

On the basis of this initial comparison, Dupuy's figures seem to be 
underestimates by a substantival degree. His estimate of the number of 
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basic words might be in the ballpark, if it were supposes to reflect the 
number of basic words a single child of average ability might encounter in 
school reading material in grades 3 through 9. His sample, of basic words 
was intended, however, to be. representative of the entire English 
vocabulary as represented by Webster ^s Third j^ew international Dictionary , 
unabridged. This would lead on/ to expect that the number of basic words 
would be somewhat similar to tlife number we estimated for printed school 
English. 

Sources of the Differences between Dupuy 's Estimate and Ou£s ♦ Having 
^staT)lishe4 that Dupuy's estimate of the number of basic words in English 
is much smaller than would be expected on the basis of our analysis of the 
words in the Word Frequency Book , we would like to ascertain as closely as 
possible the reasons for the difference. There are two major possible 
sources of differenced (a) differences in the corpora used in dv^f ining ' the 

, population of words, and\(b) differences in the definition of what 
constitutes, a basic .word. It is clear already that factor (a) is not the 
problem, since the vocabulary of printed school English is slightly larger 
than Webster's Third . The disagreement between our estimates and Dupuy's 
must, therefore, lie mostly in the criteria^ adopted for "Basic Words." 

x|xst of all, we want to determine what are the differences between 
our codingSrtf£egory "morphologically basic words" and his category of Basic 

'Words, to do'thio, we will look at some of Dupuy's criteria in detail, 
and, in this process, estimate how many words might he added to Dupuy's 
estimate if his criteria were adjusted in the directions we will suggest. 
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Dupuy excludes. from his category of basic words certain categories of 
words that would be Included among our "morphologically basic Words." 
Specifically, he excludes items that were not main entries in /the four 

* /;■ • 

dictionaries he used, and items that" were classed as technical, foreign, 



slang i or archaic- in the Random House dictionary. ^7 



// 



The first of these c^ategories seems to contain the largest number of 
words— an estimated 9-7,900 main entries in Webster 's Third//are excluded 
because .they did not appear as main entries in the other ,^^hree 



dictionaries. • A substantial* number of these would also have been excluded 
on the basis of other criteria as well; ^for example, around iialf of th 
items in the list '(e.g. abruptly , academician , acknowjedgeable ) would have 
been excluded as semantically transparent, derivatlvea; 

The motivation for excluding* such -items is clear, and seems 

/ 

legitimate: A list of the, basic words in English should include words that 
really are English words; and one might assume tl^at any item that is really 
a word in English would in fact 'show up in any substantial dictionary. But 
there are some problems with this principle of exclusion. First, any 
dictionary (besides the OED, anyway) necessarily excludes large numbers of 
possible entries, and one cannot assume that the editors'* criteria, 
whatever they may have been, were appropriate for the purpose for which the 
list of basic words is being compiled. 

Second, even a consensus among dictionaries cannot tell^u$ what words 
actually do o ccur in the materials children read^in school.. On the other 
hand, the American Heritage Intermediate Corpus was carefully selected to 
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be representative 'of printed materials used in schools in grades three 



through nine,, and gives us a solid basis for an operational definition of 
« 

what is a word in "printed school English." ' 

Among the words excluded because they were not main entries in all 

four dictionaries were an esf mated 291 that were morphologically basic (in 

the sehse that they could not be analysed into free*or recognizable bound 

stems). (This estimate is based on an analysis , of one-third of th^ 979 

items in. this category.) Another estimated 238 items in this>^oup were 

morphologically, but not semantically , analysable, for example,^ 

asthehobiosis , clasmatocyte , hang bird , moosewood . Thus,' there could be as 

» 

many as 500 items among tf.ese words that might be counted* as ba*sic woi^ds 

under somewhat more liberal criteria. If even a quarter of these were 

actually counted as basic^^^words, it would double the size of Dupuy'^s . 

original estimate. ^/ 

Fijaally, there are some words among those excluded as technical, which 

seem to be j>art of general vocabulary: coda, creosoX, f oj^maldehyde, 

' 

herpes , holmium , methyl , orthogonal , and placebo. These 8 words, since 
they are part of a 1% sample, would add another'800 words to Dupuy'^s 
estimate if they were included. 

Compound and Hyphenated , Entries . Both the WFB and Dupuy exclude all 

— — ' ( 

.compound entries, that is,^items consisting of two or more words separated 
, by spaces. In the case of the WFB , this was due to the methods of 
^leypunching adopted; with onl^ a very few exceptions, words separated by 
spaces were entered as separate words. (The exceptions were a few compound 



Words in School English 

35 

names such as New York that were incorrectly punched as single items (that 



''is, as New York ) instead of as separate words.) In the case of Dupuy^'s 
analysis, compound entries, although included as main entries in Webster's 
Third s were excluded from the. count of basic words. However, Dupuy also 
automatically excluded all hyphenated entries, whatever their nature. Our 
alialysis, on the other hand, treats hyphenated entries as it woxild 
compounds (that is,% compounds not separated by spaces) or affixed' forms . 
Any such form is individually coded in terms of its semantic transparency. 
In our' estimate of vocabulary size, we would want to include any complex 
form, hyphenated or not, which would te coded as SEM 3, SEM 4, or SEM 5, 
that is, which was sema^ically opaque to the extent that it would have to 
be learned s^patal^ely, since its meanings could* not be inferred from the 
meanings of the component parts., , . ^ 

-• M • " 

Therefore, in applying our coding system, to" Dupuy's corpus of^ords, 
we want to determine how niany of the hyphenated forms excluded by Dupuy ate 

seiaantically opaque. Of the 775 compound and hyphenated entries excluded 

> 

from the list of- basic words hf Djipuy, only 77 are hyphenated. Of these, 

we would consider at least 22 to be semantically opaquej^to the extent that 

they would have to be learned as_j.eparate items. These 22 are: 

all-fired cab-over - cap-and-ball 

charge-a'-plate chaff-flower clip-clop ' 
cross-staff ' crinkum-crankum ^ cuckoo-bread 

de;>-drink ' double-talk dove'.s-foot 

down-and-out games-all * hokus-pokus 

jack-by-the-hedge last-rditch man-aiout-town 

poker-faced rip-rap small-beer 
whing-ding" 
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Ta the extent that these do in fact represent iter's that would have to be 
learned separately, because their meanings are not inferrable from the 
meanings of their parts, we would have to add this number of items to 
Dupuy^s estimate of absolute vocabulary size to bring it in line with our 
criteria. Since Dupuy'^s estimate is based on a one-percent sample, this 
means adding 2,200 words to his original estimate of* vocabulary size. 

Derived, Variant , or Redundant Words > We^will continue the comparison 
of vocabulary size estimate's by^reviewing the criteria used' to exlude from 
the class of basic ^words^ those considered to be "derived, variant, or 
redundant. T In addition 'co examining the criteria, we will present a 
reanalysis of the 184 words listed by Dupuy in che "derived, variant, or 
redundant category. Dupuy uses the following criteria: 

A main entry was considered a derived or variant wprd form if 
in any of the four dictionaries 

1. The definition mentioned or referred back to another form 
of the same word (e.g., beck: a beckoning gesture) or was simply^a 
different tense formc(e.g., supposed : suppose). 

2. The definition was simply a different spelling (e.g., 

\ 

calimanco : calamanco) ♦ • 

J ' ' 

3. The definition wad a different word which provided .a 
fuller definition (e.g., boxberry : the checkerberry) . 

4. The entry was a combination of two or more words and the 
definition included a reference to one or more of the words (e.g., 
bookkeeper : one who keeps account books). 
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• ' The entry word was a derived fora with a base word and 

affix whose meaning could be understood with knowledge of the 
meaning of the word and affix (e«g«, adiabatic: not diabatic) . 

For each of these criteria,^ there are cases in which wo^ds will be 
excluded from the count ^of basic worda which would in fact have to be 
learned as separate items in the process of vocabulary acquisition* 

In the case of criterion 1, there are cases where a different tense 
form may in fact have meanings divergent enough from its stem so that this 
meaning would not be easily inferred* For example, striking / imposing , 
blooming , collected , elevate d, and hearing all have meanings Vhich are 
quite distinct from the meanings of their stems* 

In the case of criterion 2, it would in general seem right to count as 
**the,sa3ne word'* variants that different only in details of spelling. 
However, there are also cases of variation in spel],ing^^or example draught 
and draft which are substantial enough j:o pose real problems to a reader 
who is familiar with one vari^ntr and not the other* 

Criterion 3 is probably the most questionable of all, from the 
perspective of the reader or child learning vbcablary. A reader 
encountering the word milfoil in a text, until he or she turns to the 
dictionary, is presumably not aided by the fact that this word can be 
defined simply in terms of another word, yarrow* in fact, if the reader 
does turn to the dictionary, this type of definition is likely to pose an 
additional obstacle, if, as is often the case, the word in the definition 

is' as obscure as is the word defined* 

« 

0 

# 

.40 



Words in School English 

- . ^ • 38 

Criterion 4 is appropriate if it is applied to words whose meanings 
can In fact be understood from the meanings of their component parts* In 
practice, however, Dupuy has used to it exclude from his count of basic 
words items whose meaniri^s are not all that transparent: . fiddlewoo^ , 
flapdragon , howbeit , leapfrog, seismoscope , silvic^ulture , and threadfin. 

Criterion 5, like criterion 4, is appropriate only if the compound 

item has a meaning thab is truly predictable from the meanings oT its' 

component parts. Dupuy includes as derived words the following; whose 
» 

meanings are either not fiilly predictable on the basis of their component 
parts, or which rely on relatively rare meanings of their components; 
chanceful , clamper , coloratura , conquistador , defrock , episcopalism , 
extr^kxraganga , gy^naslast , provisional , rarefy, and valedictorian . 

Applying Our Coding Criteria to Dupuy "s Derived, Variant or Redunant 
Words. Dupuy lists 184 words as derived, variant, or redudant. We applied 
our ,oding system to ^thesG-wcrf^s^to^see how many of these words would be 
considered redundant in terms of our criteria for- grouping words. 

First df all, five of the Words that Dupuy lists as belonging to this 
category we were not able to find in Webster 's Third New International 
Dictionary , unabridged, the^urce^.of all of Dupuy's words: dashen, 
deconate, padodJLte , Rayraceous , and tragedion . We assume that these are 
due to misprints in tlie published >^ersion of his list; we further assumed 
that dashen was supposed to be dasheen , and tragedion was a mispelling of 
tragedian . Otherwise we did not fincf likely sources in the dictionary for 
these apparent errors* This leaves us with 181 words to classify. 
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Oi? the remaining words, three appeared to be cases of criterion 3, 
that is, words defined in terms of other words: dasheen (= taro) , milfoil 

(= yarrow), and diesis (= double dagger). As mentioned above, we would not 

\ 

consider these words to be redundant from the point of view of a reader 
trying to understand a text, or a child learning vocabulary* 

Twelve items from the 181 seem to be alternate spellings (although a 
^-^ew might also be treated as meeting Criterion 3). msted with their 
alternate spellliigs, these 




are: 



bress'immer 
cullender 
draught 
ebon 
' floatage 
further 
hag be rry 
inspfiefe^ 
jetton 
koorajong 
mediaeval 
proa 



breastsummer 

col'ander 

draft 

ebony 

flotage 

far ther 

hackberry 
,^_ensphere 
^^^Ifeton 

kurr^ong 

medieval 

prau 



Conserve tively^. jdir^gtrt , and perhaps also proa, are distinct enough in 



spell£ng from their alternate forms to present some difficulty to a reader 
who Icne^ only one form of the word* 

Tne remaining 166 words were coded in terms of the transparency of the 
semantic relationship between the word and its component parts, according 

to the same system used in our coding of the sample from the Word Frequency 
Book, 

_ Defining "semantically opaque" as SEM 3, SEM 4 or SEM 5, there are 43 



iteiT.s among the 184 coded which would be counted as semantJLcally opaque* 




\ 
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In contrasting our criteria with Dupuy's, and applying our criteria to 
his list of words, .we have come up with the following additions to his 
original set of basip words: 

8 words listed in the RandoQ House dictionary as "technical" 
which we would consider part of general vocabulary. 

291 (estimated) morphologically basic words among those Dupuy 
excluded because they did not occur as main entries in all four of the 
dictionaries he used. 

238 (estimated) ^oxAs among those excluded because they did not 

• •» * • 

occur in all four dictionaries, which were morphologically complex, 

but semantically opaque. 

22 semantically opaque hyphenated entries. 

, 3 items counted as "redundant" by Dupuy (dasheen, milfoil , and 

i diesis) which we feel would have to be learned as separate items. 

2 difficult spellings ( draught and proa) so different from their 

altei;native forms that they would presumably require se'parate 

learning. 

43 words counted as redundant by Dupuy, which we consider to be 
semantically opaque. 

This adds up to .a total of 607 additional words beyond the 123 already 
counted as basic by Dupuy, This would bring the total number of basic 
words in Webster 's Third up to^ 73,000. This figure is much closer to our 
estimate of basic words in printed school English (88,533); although it is 

# * 

IE , 
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still a little lower than our figure, it 
Jhipu^s original estimate of the number /of basic words. / 

?twe^ 



is almost six times as great as 



The bulk of the difference betwe.en Dupuy's original estimate and our 
figures seem to be traceable to tw^ main factors: ^^st, pupuy's use of 
'four dictionaries excludes a lax^ number of wor<^s — most of them rather low 
in fre(|uency to be sure — which^we would include. Second, he clearly sets a 
different cut-off point with respect to which words are to be counted as 
' semantically redundant. He seems to place a much greater weight on 
morphological relatedness, ,and considers as redundant words which we would 
consider to have only rather distant semantic relationships. 

In sximmary, we might say that Dupuy^has adopted a prescriptive rather 
than descriptive concept of what constitutes a basid word in English, and 
that his estimates do not at all reflect the diversity of vocabulary 
encountered by children in reading school texts. 

> 

Seashore and Eckerson's Estimate 

Like Dupuy (1974), Seashore and Ecker son (1940) attempted to construct 
a test which would measure not only relative vocabulary .knowledge, but also 
given an indication of the absolute size of a person's vocabulary. They 
also used the method of selecting a random sample of items 'in an unabridged 
dictionary.*" We want to contrast our estimates of vocabulary size with 
theirs first, because their study has served as a basis for much subsequent 
research in vocabulary size, and second, because it has been subject to 
careful scrutiny by Lorge and Chall (1963). 



ERIC 
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Seashore and Eckerson took as their population of words the e^itries in 
Funk and Wagnalls' New Standard Dictionary of the English Language, the two 
volume edition of 1937. This dictionary was chosen because it was large 
enough to represent the full range of adult vocabulary without including 
extremely rare words. Also, it contains all words in a single alphabetical 
order, making it- easier to construct a subsample for testing. 

This dictionary contains two types of entrie^: "basic" words, or main 
entries, ^inted in heavier type and next to the left margin, and 
"derivative" terms, which are indented under the basic term. Seashore and 
Eckerson estimated that the dictionary contaii^s 166,247 "basic" words, and 
an additional 204,018 "derivative" words, excluding multiple meanings and 
variants in spelling. 

To some extent, the distinction between basic and derived entries can 
be stated in terms of word formation processes. That is, derivati-ve 
entries are words derived from their basic entries by suffixation or 
compounding. Seashore and Eckerson give the example of the basic word 

and its derivatives Loyal Legion, loyalism, loyalize , and loyally . 
However, not all words derived bv^ompounding or suffixation are listed as 
derivatives; many such items are\ basic words. For example, master, 
masterful > masterhood , masterless , masterly , masterpiece , mastership, 
master singer , raasterwork, and mastery are all basic words, that is, main 
entries in Funk and Wagnalls' dictionary. Furthermore, prefixed forms, 
becatise they occur elsewhere in an alphabetic list, also constitute 
separate main entries. 
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• The criteria for placement of an item as a main ox derivative entry, 
are not explicitly given In the dictionary. The principles fpllowed' seem 
to be approximately these;. First, compound entries (that Is, entries with 
Internal spaces) are treated as derived entries, except In the case of a 
few which are also proper aames. Second, suffixed Items whose meaning is 
predictable from that of the basic word with no or little additional 
definition are usually treated as derived entries. This includes most- 
adverbs in -ly, nominalizations with -ness, and many other adjectival 
forms. For the remaining suffixed items and compounds, which could be 
listed either as basic or derivative, one of the criteria for placement 
seems to be some notion of'-"importance." For example, iceboat and 
Icebreaker are basic entries, while ieecliff , icefoot , icequake , and others 

are listed as derivatives. "Importance" seems to correspond pretty closely 

^-''^ ■ i ' • 

to frequency. 

t 

In some cases, alphabetical^ order and the arrangement of words seem to 
play a role. For example, under the basic item Eurystomata are listed the 
derived words eurystomatous , eurystoman , eurystomous , eurystome, 
eury thermal , and eurythermlc. Were t to' precede s in the alphabet, it 
seems likely that eurythefmal would have been the basic word, and 
Eurystomata one of the I'erlvatlve items. The principle followed here seems 
to be that if a number of relatively rare or unimportant compounds occur in 
succession, the first is given as a main entry, and the following as 
derivatives.* This also seems to be the case, for example, when under the . 
basic word meteoromancy are listed the derivative items meteorometer , 
meteorosc ope, and meteoroscopy . 
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A slight further complication is that some compounds are listed as 
derived items, and also as main entries, with the main entry referring to 
the definition given for the derived item. 

In many gases', derived items are redundant, or semantically 
» transparent. That is, if one knows, for example, the meaning of the basic 
iteia evangelical , the meaning of the derivative evangelicaljLsm is likely to 
be s^f-evideiit. On the other hand, a substantial proportion of the 
derivitive entries in Funk and Wagnalls may not be so semantically 
transparent. For example, knowing the meaning of stay does not guarentee 
that one will be able to figure out the meaning of stayplow (a type of 
plant, also called restharrow .) . 

It cannot be assumed that alj. basic entries are .semantically distinct, 

- either. For example, one might consider the meaning of gusty as rather 

. ' i • ^ 

obvious, given the meaning of the word gust . Similarly,, evaporate, 
evaporation , and evaporator are listed as distinct basic entries, despite 
their clear semantic relatedness. 

Thus, it is not clear exactly how Seashore and Eckerson^s estimates of 
vocabulary size should be interpreted. The figure of 166,247 basic words 
and 204,018 derived words, totalling 370,265 words, reflects the make-up of 
an unabridged dictionary, but cannot be directly interpreted in terms of 
any particular theory/ of words and how t\iey are learned. 

Lozge and Ch all 's Critique of Seashore and Eckerson. Lorge and Chall 
(1963) have critically ex^ined the work of Seashore and ^Eckerson, and 
noted several weaknesses. One relates to the problem of space sampling. 
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The method used to obtain a sample of words from the dictionary — taking the 
third basic word in the first column of every leftrhand page in the 
* dictionary-Tturns out to produce a sample that is biased in that it 
contains disproportionately many common or easy words. This makes the 
vocabulary test based on this sample easier, and hence leads to an. 
overestimation of the vocabulary size of the pe:;^son taking the test. 

Lorge and Chall also noted some errors or. xnconsistancies in counting. 
For example, Seashore and Eckerson claimed not to count duplicate spellings 
in their count of basic words, but Lorge and Chall found that 2% of the 
Basic words in their initial estlmto of vocabulary size were in fact 
duplicate spellings. Another inconsistancy relates to -homographs. Lorge 
and ChaXl argue that since Seashore and' Eckerson take as a criterion of 
.word knowledge recognition of any common meaning of a word, they should not 
count homographs as separate items. However, homographs (counted* as 
distinct items) amounted to 9% of the basic wor^s in Seashore and 
Eckerson^s estimates. 

More importantly, Lorge and Chall disagree with Seashore and Eckerson 
as to what should be counted in an estimate of vocabulary size. They 
suggest excluding the following categories of. items, which amount to an 
estimated 30Z of the entries in Funk and Wagnalls: Names of persons, 
Biblical names, other names (mythical, races, etc), names of flora and ^ 
fauna, geographical place-names, abbreviations f suffixes, prefixes, and 
com^Jj^ing forms. Taking all these adjustments lint o account, Seashore and 
Eckerson's estimate o£ 166,000 basic words is reduced by about 40%, to 
99; 600. 
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Comparison with Our Estimate * How many words are in printed school 
English if one adopts the criteria from Seashore and Eckerson (1940)? To 
compute the number of •'basic words" by their definition, we can start with 
our number of '*Web8ter main entry equivalents," and make, the following 
adjustraentQ: First, all but the most common compounds would be excluded, 
^Ince they would be derived entries in Funk and Wagnalls. Also excluded 
would be all semantically transparent suffixed forms* On the other hand, 
we would have to add to our estimate basic proper i.ames and capitalizations 
homogtaphic with proper names, since these would be main entries in Funk 
and Wagnalls* (To come up with an estimate based on Lorge" and Chall'^s 
(1963) revision of the criteria for^ "basic words," we would exclude these 
last two categories*) The number t:or responding to Seashore and Eckerson's 

total words** would\be the number of ''Webster main entry equivalents," 
including all derived ^nd compound forms, plus basic proper names and 
captlalizatioDS homograp^ic with proper names* 

Table 7 compares Seashore and Eckerson's (1940) estimates of the 
number of words in English with ^he results of applying comparablfe 



Insert Table 7 about here; 



definitions of "word" to the WFB and the underlying population of words* 
This t:-ible also includes estimates of the number of main entries and "ba»^c 
words" in Webster 's Third by Dupuy (1974) and the results of applying 
somewhat similar definitions of "word" to the data in the WFB* 
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It Is interesting to note that in every case but that of Dupuy's 
"basic words/* the autho^' original estimates are- j^ther cJLose to the 
figures derived by applying comparable criteria to the population of words 
in printed a/chool English. This is an indication that the thre\ sources of 
vocabulary—printed school English as sampled in the WFB, Webster's Tb ' M • 
(unabridged), and the Funk and Wagnalls dictionary used by Seashore and 
^ckerson (1940) — are all of .approximately the. same size, especially when 
adjustments are made for the fact that Webster 's Third s unlike the other 
-two sources, includes only a restricted range of proper names, and for the 
fact that the WFB, unlike the two dictionaries, does not have separate ' 
entries for compound items, '^he differences between the columns in Table 7 

are therefore due largely t^- differences In the definitions of "word" or 

\ 

' "basic word" that wete adopted. Had the authors been able to agree on 

\ • 

these definitions, there would have been fairly close agreement as to the 
total number of words in English, 

1 ^ 

How Many Words Are There In English ? 

In the estimates of total number of words in English we have just been 
comparing — based on large unabridged dictionaries and a statistical 
projection to the total vocabulary of printed school English — the major 
difference between the magnitudes has been due to disagreements about 
criteria used for counting. To answer the question 'How many words are 



there in English?" one has to determine what is the appropriate definition 
of "word" to use. ^ 
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We feel that the best way to approach the eounttp g of words Is In 
terms of distinct word families, where a ^word family" Is a group of 
morphologically related words such that If a person knows one member of the 
family, he or she will probably be able to figure out the" meaning of any 
other member upon enc.ounterlng It In text, with Information from context 
that would be available for most occurrences of that word. 

Counting as distinct word families ail morphologically basic words and 
semantlcally opaque. (SEM 3, SEM 4 and SEM 5) rierlved words, we have 
estimated that there are 88,533 distinct word families in printed school 
English. However, some substantial qualifications must be made before this 
number can be correctly interpreted^ ^ 

First of all, how words are to be counted depends on why you are ' 
counting them* Our interest in estimating the number of words in printed 
school English is to determine the size and nature of the task that 

children face in learning the vocabulary of school texts. Whether we 

-a 

should count understand and misunderstand as one word or two depends on how 
children actually deal with them. If children who know the meaning of 
understand can learn the word misunderstand , or interpret it in context, 
with llttlu or no additional effort, then we would want to count these two 
words as being members of a single word family. 

Therefore, any criterion for counting wQrds must be relative to some 
level of morphological knowledge. For this reason, a truly meaningful 
estimate of the number of words in printed school English will require 
empirical studies of children's knowledge of morphology. Our ^system of 
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coding diffierent degrees of semantic relatedness is an attempt to 
approximate what we believe the results of such studies would be; but it 
remains speculative until these coding categories can be tied to particular 
age and ability levels. ' • 

Our estimate of 88,533 distinct word families assumes that children in 
grades 3 through 9 would not be helped .much by morphological relatedness 
among words if the degree of semantic relatedness were SEM 3, SEM 4 or SEM 
5» For example, knowing the meanings of hook and worm would not provide 
sufficient inforroation for the child to guess the full meaning of hookworm 
unless the context were rich enough to give unmistakable clues for the 
,reraaining semantic components (e.g. parasitic, causing disease). 
Therefore,, hookworm and similar derived forms were counted as constituting 
separate word families. However, if .we could somehow establish that 9th 
graders* were able to make use of SEM 3 relationships in learning or 
interpreting new word meanings, our estimate of the number of distinct word 
families for ninth grader? would have to be reduced to 61,934. Conversely, 
if we were to find that children atf a certain grade level were less adept 
tl^n we expected at seeing and utilizing relationships among words, our 
estimate of the number of distinct word families for children at that grade 
level would have to be revised upwards. 

Other Categories of Nonredundant Words. Another way to talk about 
word families is in terms of redundant versus nonredondant words. If a 
child who knows the meaning of estimate can automatically interpret or 
learn overestimate , the latter word is redundant; it does not contribute to 



( 
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the child's Vocabulary learning task, or add to the vocabulary, load of a 

■ * — — U rn, ,,,, \ . — .I.. - 



text the (ihli'd Is reading. Our figure for the total number of distinct 
wor^ families 'is supposed to reflect the number of nonredundant words in 
printed school English. However, there may be. several types of words not 
included in this count which also should probably be counted as 
nonredundant in tet^s of the effort they would require to learn or 
Interpret. 

For example, abbreviations were not included In our count of distinct 

V 

irord families., because they do not constitute distinct words in the 

\ . >^ 

prototypical sense. One m^ght consider them fo be redundant in that an" 
abbreviation has the same raea\iing as the word for which it stands. 
However, /the relationship of\ah abbreviation to its unabbreviated form, and 
hence x^ts meaning, Is not at ali obvious in*all cases; most often, an 

abbreviation must be learned as\a separate item. 

\ " . - 

On siyilar grounds, one might want to include in the count of distinct 

£^ V ^ ' ' 

word families other categories in our coding system such as truncations, 

\ 
\ 

Irregular inf^lections, irregular comparitives and superlatives, some 
alternate form^ of words, and semantically irregular plurals. For/ each 
category. It coi)^ld be argued that many or most of the items wei;^not 
redundant — that Is, that knowledge of other, related forms would not 
guarantee the reader a fair chance of understanding that item when 
encountering it the first time in reading. 

All the categories just mentioned would add only an estimated 4,935 
words to the population, bringing our total vocabulary estimate up to 
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93,468 distinct word famill^s^ However, if we want to estimate the total 
number of words in printed school English in terms of nonredundant items to 
be learned several other categories of items might be added which would 
increase this overall figure substantially. 

Proper Names . Both Dupuy (1974) and Lorge and Chall (1963) exclude 
proper names from their count of basic words. This exclusion is presumably 
ba,sed on the fact that proper names are functionally distinct from other 
vocabulary items in a number of ways. In some theories of meaning, for 
example, it is ar^gued that proper names have reference, but no' meaning, 
unlike common nouns which can haVe both reference an<l meaning. In the 
Context of reading, it might be argued that a child only has to recognize a 
proper name as being such, ^nd that any information about the individual 
associated with that name will either be supplied in the story itself, ar 
should be considered knowledge about the world, and not vocabulary 
knowledge as such. 

This is a complex issue, more so than we could do justice in the scope 
of this paper. One could argue, however, that there is at least a subset 
o^ proper names that should be counted as part of general vocabulary. 
Certainly, the nances of characters are usually assigned a referent within 
the context of a story, so that the reader often needs little, if any, 
prior knowledge about that name to successfully comprehend the text. But 
there are some proper names which are most often not explained within 
texts, and which the reader must be familiar with in order to properly 
understand the text. This is certainly true of many familiar geographical 
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place nanes. Lack of knowledge of the reference of wordo such as 
Washington y Florida , Alaska , or Panama could contribute to comprehension 
failure In exactly the same way that Ignoran^.e of the meaning of other 
words In the text might. Thus there Is at least a subset of proper names 
which on practical grounds might be considered as an Integral part of a 
person^'s vocabulary knowledge. 

A related point is that the line between proper names and other areas 
of vocabulary — for example, names of flora and fauna, or technical terms — 
IS' ijot clearly defined. For example, eagle is counted by Dupuy as a basic 
word, but Megaloceros as a proper name. There are differences between 
these two words, in terms of^ usage and frequency, but it isn't clear that 
these differences bear directly on the classification of an item as a 
common or proper noun. . 

Determining which or how many proper names should be included in an 
estimate of vocabulary size would require some more detailed work on the 
role of proper names in reading comprehengion. A rough estimate, however, 
W;§i^s m^de in the following fashion: Of the 929 morphologically basic proper 
names in our ^sample, a count was made of those which intuitively seemed to 
be "important'' — that is, knowledge of them would be likely to be assumed in 
at least a large proportion of school texts. Eighty proper names -net this 
criterion* A second count, of those proper names that were listed in the 
American Heritage School Dictionary, gave the same result. It would seem 
reasonable to assume cliat those proper names which were necessary for 
understanding school texts would b$ defined in this dictionary, and vice 
"^ersa. 

*- 
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Since there are eighty proper names in our sample knowledge of the 
meanings of which would probably be assumed in most school texts, there 
would be about 956 such names in the WFB* Assuming that important proper 
names are relatively high frequency words, there would be perhaps 1,000 
such names in the population, and possibly several times as many* 
Especially in the higher grades, one would expect that an increasing number 
of proper names would be assumed rather than explained in school texts, and 
thus shbuld be counted as part of the demands on the child's vocabulary 
knowledge* ' 

Homographs ♦ Most estimates of vocabulary size, and all of those we 
have been discussing, lump together all homographs* But a child who knows 
only the noun bear (= animal), when confronted with the verb bear (= carry) 
in a text for ^the first time, is enco.untering a brand new word. Knowledge 
of the one meaning of bear is no help in figuring out the new meaning. In 
fact it is probably a hinderance. For this reason, if an estimate df 

vocabulary size attempts to reflect the number of nonredundant items a 

< 

child would have to learn, it would have to count distinct meanings of 

homonyms as separate items. Even related, but somewhat different, meanings 

of the **same word" may present difficulties to young readers. 

I 

An estimate of the extent of homcphony in printed school English was 
made by counting the number jf distinct meanings for a random sample of 156 
of the morphologically basi. words identified in our 7,260--word sample of 
the words in the Word Frequency Book . The primary dictionary used for 
determining number of meanings was the American Heritage School Dictionary. 
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Since this dictionary was based on the American Heritage Intermediate 
Corpus, which also formed the basis for the WFB, it should reflect the 
number of meanings actually occurring for a flven item in that corpus. For 
words which did not appear in this dictionary, we used Webster ^'s Third New 
International , unabridged. This introduces a potentially confounding 
factor, *since an unabridged dictionary would be likely to include a larger 
number of meanings for any given item. However, for each item, a code was 
used to represent which dictionary was used' to determine the number of 
meanings, so that this could be taken into account in statistical analyses. 
Morphologically basic words appearing in neither of these two dictionaries 
were assumed to have only one meaning. 

The number of distinct meanings for each word were counted at each of 
fiv^ levels of semantic distinctness,, defined in terms of the levels of 
semantic distance between meanings used in our coding system. On^ example 
should make the relationship between the two codes clear: Two meanings are 
counted as distinct at level SEM 2 if the distance between them was greater 
than SEM 2 in terms of our original coding syst em«^ Two meanings were 
collapsed (counted as nondistinct) if they were related at a level SEM 2 or 
lower. 

The end points of our scale are defined as follows: At le^el SEM 0, 
any variations in meaning listed in the dictionary, however minor, were 
counted as distinct, along with any meanings for subentries such as other 
parts of speech, idioms, and phrases. At level SEM 4 two meanings were 
counted as distinct only if there was ko relationship at all between them 
that would be of any use in learning or remembering the two meanings. 
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In additian to these five levels, for each word we also encoded the 
number of homographs, as numbered with superscripts in the American 
Heritage School Dictionary > or the number of etymologically distinct 
sources in Webster 's Third , A seventh number r.epresented' the sum of all 
phrasal or idiomatic entries associated with each word. 
/ As an example of haw this coding system worked, here JLs. how the word 

desert was analysed. The entries for desert in the American Heritage 
School dictionary were as follows: 

desert(l) n. A dry, barren region, often covered with 
sand, and having little or no vegetation 
adj, UnlnhaUitad: a desert island 

*^ 

desert(2) v. 1. To forsake or leave: abandon 

2. To leave (the army or an array post) illegally 
^ and with no intentioa of returning 

desert(3) n. Often deserts. That which is deserved or merited 

/ 

There is a total of five distinct meanings listed in these 
Jefinitions; thus, the number of distinct meanings 'at level SEM 0 would be 
five. At level SEM 1, the two raea^lngs of the verb (desert(2)) would be 
grouped together, since mo^t contexts should make the military implications 
of the word desert fairly obvious. At level SEM 2, these four remaining 
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meanings would still be distinct, but at level SEM 3, where any clearly 
related meanings are grouped together, the adjective meaning of desert(l)/ 
would be grouped with the meanings of 'desert (2). At the level SEM 4, the 

^meaning of dedert(l) (the noun) would be grouped together' with these, 
leaving^pnlxl^lstiact mean^^s. This word would still be counted 
three homographs, based on the ^numbering system of the American Heritage 
School Dictionary * 

One might argue that the noun meaning of desert (l) should have been 
grouped with the verb meanings at level SEM 3 instead of SEM 4, since the 
rela'-ionship between the two is fairly clear. On the other hand, perhaps 
due to the difference in pronunciation, we would gue^s that. most 
individuals do not make a conscious connection between the two meanings. 

Ultimately, such decisions would, have to be based on empirical 
studies. On the other hand, while our current coding system is subjective, 
Dupiiy^s (1974) criteria for whether or not a word is redundant are not 

^inherently any more objective than ours. Our criteria have the advantages 
of making finer distinctions, that is, recognizing degrees of semantic 
transparency, and being at lea^t in principle defined in terms of the 
difficulty a wpjd might present to children encountering it for the first 
time in reading. In addition, the two end points of our scale of the 
nura.ber of meanings for a word (SEM 0 and the number of homographs) are 
operationally defined. 

The results of thi*s ^analysis are presented in Table 8. For each 
measure pf polysemy — the five levels of semantic distinctness, the number 
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of homographs, and the number of phrasal and idiomaClc entries, two 
measures ar^ given. 



Insert Table 8 about here. 



The first is the mean number of meanings; that is, the total 'riumber of 
distinct meanings divided by the number of morphologically basic words. We 
can assume that our sample of 156 morphologically basic words is 
representative of the morphologically basic words in the WFB . The 
frequency distribution of morphologically basic words iir the population is 
different than that in the WFB. For levels SEM '2 and SEM 3, estimates are 
given for the population as well, taking into account that the population 
will have a higher proportion of words with lower frequencies arid fewer 
meanings. (Estimates are given for levels SEM 2 and SEM 3 tecause these 
levels are most likely to reflect the knowledge of relatedness among word 
meanings in grades 3 through 9. In our opinion, SEM 3 should give a very 
conservative estJtaate, and probably an underestimate, of the number of 
meanings that would be functionally distinct af this level.) 

The second figure is the total number of disitinct meanings among the 
morphologically basic words. Estimates are given for the WFB,* and, for 

evels SEM 2 and SEM 3, for the underlying population as well. There are 
an estimated 10,108 morphologically basic words in the WFB. At level SEM 
2, there are about 2.038 distinct meanings per morphologically basic word, 
and hence a total of 20,600 distinct meanings of morphologically basic 
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words. For the population of morphologically basic words in printed school 
English, there would be approximately 73,A17 distinct' meanings. . These 
figures are lower for level SEM 3, since fewer meanings are counted as 
distinct at this level. 

A count of all semantically distinct^ys^cabulary items will have to 
Include not only all meanings of morphologically basic words, but also 
meanings of semantically opaque derived words. (Numbers for these are 
taken from Ta.ble 6, which gives a more conservative estimate' of the number 
of semantically opaque forms, assuming, so to speak, that the individual 
already knows all the meanings of the base forms.) This measure can be 
added to the number of distinct meanings among the morpttologically basic 
words to give an estimate of the total number of distinct meanings in the 
vocabulary (for any given criterion fpr semantic distinctness). 

Table 9 gives the total number of distinct meanings at .^.wo levels of 
semantic relatedness* At' level SEM 2, the total ntunber of distinct 
meanings in printed school English is estimated at 105,238. At level SEM 
3', the total is 67,417. 



Insert Table 9 about here. 
.„-^ . 

Compound Entries . Dupuy (1974) and the Word Fre quency Book both 

' / 

, r' 

exclude compound entries, that is, those which consist of two or more words 
separated by' spaces. Approaching the issue of vocabulary size from the 
pers^ctive of learning new items, it would seem more appropriate to 
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exclude those (and only those) compound entries Vhose meanings were 
computable on the basis of the meanings of their parts, so that a child 
encountering this combination for the first time in the process of reading 
could, with a little help from context, infer its meaning. 

A survey of the 698 compound entries excluded by Dupuy indicates that 
a substantial number of thera have meanings which are not totally- 
predictable from the meanings of their p^arts. First of all, there are 
idioms such as bum steer , favorite son, one-night stand , or straw man. " 
There ate about 77 such items among the 698 excluded by Dupuy which have 
meanings obscure enouglf* that a child would almost undoubtedly have to learn 
them as separate items. 

There are at least 134 additional items which are semantically opaque 
in the following sense: It is clear" that a snake fly is a kind of fly, or 
^^^^ ^ ^"^P b^^^ is a kind of bean. But the word snake does not really 
tell what kind of fly a snake fly Is; nor does the word snap give enough 
Information, on the basis of its literal meaning, to distinguish snap beans 
from other beans. The actual reference of such terms must be learned 
individually for each such item. Altogether, then, there are 211 items 
among the 698 compound terms excluded by Dupuy which are idiomatic in that 
their exact meaning is not predictable from the meanings of their component 
parts. 

Since Dupuy's analysis is based on a 1% sample of Webster 's Third , 
this means that there are approximately 21,100 semantically opac^ue compound 
items in that dictionary. Considering that the vocabulary of p^ted 
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school English has been found to be comparable to that in anunabridged 
dictionary in other respects, we would expect somewhere near this number of 
semantically opaque compound items to be foun^/in school texts as welll 
Much of this number, however, has already been incorporated into our 
measures of polysemy, since our. count of the number of distinct meai^gs 
included all phrasal and idiomatic entries related to any morphologically 
basic word. From the number of semantically opaque compound entr;ies in 
Webster^s Third, however, we can be fairly sure that our estimate of the 
contribution of polysemy to the size of vocabulary is a conservative one. 
Total Count of Nonr^dundant Items * 

Given an estimate of at least 1,000 proper names that should be 
counted as part of general vocabulary knowledge, and 4,000 abbreviations, 
irregular inflections, and other orthpgraphically nonredundant words, added 
to the figures alre^idy calculated for incorporating polysemy, we come up 

th an estimate of 110,000 distinct words in printed school English. This 
number assumes that individuals are only able to utilize SEM 0, SEM 1 and 
SEM 2 relationships in learning or interpreting new words. For someone wno 



is able to utilize SEM 3 relationships as well, the number of distinct 
words would be 72,000. 

The Distribution of Words by Frequency 

So far, we have shown that printed school English includes a very 
large number of words, comparable to the number of words in a fairly large 
unabridged dictionary. Now we would like to determine, as far as is 
possible, how many of these words a student in grades three through nine 
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might actually encounter in reading, and how many of these words would 
actually be useful to a student* ^ ^ ' 

One way to approach this question is to look at the frequencies of the 
^ words.' Table 10 shows how the words in printed school English ar^ 
distributed by frequency. Frequencies are given in terms of U, or 

tlmated^ frequency per million words of text, A word with' U 10-0,. for 
example, would be expected to "bccur on the ^veFage about ten times in a 
million words of text. Details of how U is calculated are found in the WFB 

(p. Xl).r * ' 

The -numbers of graphically distinct types with a frequency equal to or 
greater than a given value are interpolated from table's in the WFB./ These 
numbers are predicted oiV the basis of the lognormal model; according to 
this model, if frequencies are expressed logarithmically, words will be 
found to occur in a normal distribution along the frequency continuum. 



Insert Table 10 about here. 



The number of morphologically basic ^ords and seraantically opaque 
derivatives (Included here are SEM 3. SEM *4 and SEM 5 derived forms) gives 
us an approximate idea of the number of distinct word families among the 
words above any given frequency level. It should be cautioned that the 
number of distinct word families at any given level is underestimated 
somewhat, since the most frequent member of a word family is sometinies a 
regular l^nflection or transparent derived fonr.. The word month, for 
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example; hag^aJlJ of 71.635, whereas the U of the plural months is 115.15- 
Thus, the word family containing month and months is not included In the 
count of- 555 morphologically basic. words and semantically opaque 
derivatives that have a U of 100.0 or greater,' However, among the iwords in 
that frequency range, one does encounter a representative of the mont h 
family, so that more than 555 word families are actually represeritejd. 

Semantically transparent derivatives include thosej derived words 
(suffixed, prefixed and compound forms, and a few idiosyncratic forms like . 
prophesy ), the , meanings of which are largely or wholy predictable from the 
meanings of their component parts (i.e«, SEM 0, SEM 1 and SEM 2). 

At least two things are clear about the distribution of words by 
frequency. First of all, mpst words are in the lower ranges o^ the 
frequency specrr.um. About half the words in printed school English, no 
matter how one counts them, occur roughly once in a billion words of text 
or less. Second, semantically transparent 'derdvatives are skewed tiwards 
the low end of the frequency distribution to a greater degree than are 
morphologiclly basic words and is^antically opaque derivatives. The 
relative proportion of these two categories changes radically from one end 
of the distribution to the other; althougTi there are substantially more 
transparent derivatives than there are morphologically basic words and 
semai^tically opaque derivatives, among the most frequent words the 
semantically transparent derivatives are relatively rare. 

This difference in distributions has some dfistinct implications for 
instruction. If a xrhild were exposed only to vocabulary controlled 
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carefully by frequency, there would be both relatively little opportunity 
to learn, and little necessity to make use of, the word-f orraatipn processes 
that relate derived words to their component parts. The relatively few 
transparent derived words that do occur in the higher frequency ranges are 
likely to be learned, at least at first, as unanalyzed wholes (cf. Kuczaj, 
1977; Sllvestrl & Sllvestrl, 1977), On the other hand. It Is clear that as 
oner's exposure to the language expands Into f:he lower frequency ranges, 
knowledge of word-formation processes becomes an Increasingly necessary 
skill. 

At this point it might be appropriate to comment on the importance of 
low frequency words. One might be tempted to argue, after all, that words 
occurring one in a million words of text or less — however many such words 
there may be — are really not worth much consideration. If tha student 
encounters such words on the average once a year or less (for any 
Individual word) there wouIdn^t seem to be a need to include them in any 
program of vocabulary instruction. 

But bejrore jumping to any conclusions about words in the lower ranges 

of the frequency c6ntinuum, it might be useful to look at what words are 

actually involved. Many of them do seem to be of little general use, but 

there are some rather useful-seeming words there as well. Among the words 

occurring less than once in 100 million words of text (U = 0,008) are ones 

such as: . ^ 

amnesty elevate gnome persecute 

appall ^ evict horuswoggle racoon 

assimilate -expound ignoramus rambunctious 

busybody flex jellybean rote 
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cheeseburger fluent liturgy 
contemporary fume mediace 
eczema furor papaya 



shamrock 

stenographer 

syncopate 



Among the even rarer words, occurring less than three times in a billion 
words of text (U « 0.0025) are: 

ammeter anneal billfold ^ cloverleaf 

cyanide deform hex ' orthographic 

solenoid template unwieldy ventilate 

♦ celiiQpe emanate extinguish flippant 

nettle pidgin sattirate C'agull 

spinnaker fresco inflate sacrament 

This is not a repr*esentative sample of low-frequency words^ to be sure, but 

these examples do demonstrate that just because a word has a relatively low 

frequency in printed school English does not mean that it is of little 

utility. 

Since a word^s frequency does correlate with tho probability that an 
individual will know that word, it is easy to mistakenly identify low 
frequency with difficulty. But almost any book by Dr. Seuss will sorve as 
proof that utterly novel words are not necessarily difficult for a child to 
read. Yet many such words occur only once in a single story, and thus 
would have astronomically low frequencies in any large scale survey of word 
freqency. 

The frequency of a word reflects a number of factors; one of them is 
often the conceptual difficulty of the word. But in general it might be 
said that a word's frequency reflects the range of contexts in which the 
word might appear. A "rare" word such as sacrament is important within a 
certain set of contexts, but this set of contexts is very small compared to 
the universe of contexts that are covered in printed school English. 



ERLC 



Words in School English 

65 

It should also be noted that frequency studies such as^ the WFB that 
involve very large samples pf written language are not representative of an 
individual student's exposure to the language. Because choice of words 
will be more consistent within a given author^'s works or a given subject 
category, any individual student will not get a random sample of vocabulary 
containing a wide range of low frequency words occurring once each. 
Rather, in a given student's reading, most low frequency words will not 
occur at all, and of those that do, many may occur a number of times. 

There is an important sense in which the frequencies listed in the WFB 
underestimate che. true frequency of occurrence for a given word. A 
student's exposure to nhe word drive, for example, is not a function of the 
frequency of that graphically distinct type alone, but rather, a function 
of the sum of the frequencies of all members of the family* In this case, 
one would certainly want to include forms such as Drive^ driven , driver. 
Driver , driver 's. Driver's, drivers, drivers ', drives, and drove. The 
frequency of this entire family is over three t^icies greater than the 
frequency of the morphologically basic word"" drive , this particular family 
is more extensive than many, but it is still true that family frequency is 
always greater than or equal to tlxe frequency of any individual member. In 
this sense, students may encounter some of the lQw--f requency words in 
printed school English more often than one would gather from the 
frequencies reported in the WFB. 

Finally, it should be noted that the materials on which the WFB is 
based tend to have a higher proportion of high frequency words than does 
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printed matter written for adults. This means that the frequencies 
'reported for rare words in the WFB will in general be lower than the 
reported frequencies for the same words in adult materials* 

The distribution of words by frequency does show that of the many 
w^drds in the vocabulary of printed school English, a large portion have 
very low f requeuc.les. Nevertheless, one must, be careful in interpreting 
this^fact. It would be a mistake to suppose, for example, that all words- 
occurring once in a million words of text were so tzechriical or specialized 



as to be of no pedagogical significance. ^ * 

How Many Different Words Do Children Actually Encounter ? 

To get an accurate picture of the vocabulary that students actually 
encounter in printed school materials will require both information on the 
amount and type of reading done by children in and out of school, arid a 
reanaiysis of our data by grade level. Our plans for future research 
include both these steps; at present, however, We can get at least an 
approximate idea of the number of words students have to deal with in 
school reading. At the' lower end of the spectrum, one might imagine a less 
able reader at one of the lower grade levels reading as few as ten pages a 
day from books with large print and frequent pictures, averaging 100 words 
per page; If thia rate were maintained through 100 days of the school 
year, 100,000 running words of text would be covered. This figure would 
seem to be a lower limit to the amount of reading done in school betweep 
-grades three^ and nine. On the other hand. It does not seem unlikely that 
an average reader in \seventh grade might spend fifty minutes a school day. 
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in actual reding, at a rate of 100 to 200 words per minute. In 100 school 
days, 500,000 to 1,000,000 running words of text would be covered. This is 
certainly not a maxiraura; .given a higher reading speed, a little mora time 
spent in reading, and more consistent reading during the year, and a child 
might cover 10,000,000 running words* 

The forgoing estimates may be conservative, Carroll (1964) has * 
conjectured that college students may be exposed to as many as a million 
running words a week in their reading, lectures, and conversations. Our 
own conjecture is that there are avid readers from the middle grades who 
approach this figure, 

I-'rom the WFB/(see Table B-9, p, xxxvii) it appears that a student in 
grade*? three through nine who reads 500,000 to 1,000,000 running words of 
text in a year will be exposed to between 20,.000 and 40,000 graphically 
distinct types. From our analyses of the this would ^niean that 

somewhere between 4,000 and 10,000 distinct word families might be 
encountered. More precise estimates will require analysis of our data by 
individual grade levels. In the meantime, we can be fairly confident that 
ah average reader in the upper half of the grade range would encounter at 
least 5,000 distinct word families in a year, perhaps as many as 10,000, 
At least 1,000 of these would be families that had not been encountered in 
the previous year, and it is quite possible that an active reader in these 
grades could come across three or four thousand totally new vocabulary 
items in the course of a school year. 
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Further analyses will allow us to specify with much more precision the 
number of new word families that a child in any grade would be likely to 
encounter. However, even the present rough estimates are sufficient to 
demonstrate that direct instruction could not cover more than a small 
fraction of the words that a student will actually encounter in school 
reading. , 

Word Families in Schbol English 

How much interrelatedness is there among words in printed school 
English? One way to approach this question is' in terms of the size of the 
average word family. If there are are 609,606 graphically distinct types 
in printed ^hool English, and only 88,533 distinct word families, one 
would expect there to be 6.88 members per family. This figure is 
inaccurate, however, because there are several kinds .of*words (e.g., 
numbers and proper names) which were not included in any family at alx. 

Table II represents the average composition of a word family in 
printed* school English. Since the concept "word family" can be defined 
only with respect to some level of morphological ability, we have decided 
to give figures based on two different definitions. 



Insert Table II about here. 



Definition A adopts a conservative estimate of the number of distinct 
word families in printed school English. Assuming, in this case* that some 
indiviucals jnight make effective use of even SEM 3 and SEM 4 relatedness in 
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learning derived words, we count as distinct word families only 
morphologically basic words and derived words with a semantic relatedness 
level of SEM 5* By this definition there are about 54,000 distinct word 
familie_s. Since people frequently learn words without perceiving 
relationships that do exist betwen them (e»g., basement and base ) we would 
consider this to be an underestimate of the true number of distinct word 
families; however, it can serve as a reasonable lower limit* 

Definition B is the definition of word family we tiave adopted up to 
now; it includes morphologically basic words and derivatives at levels SEM 
3, SEM 4 and SEM 5. By this definition there are around 88,500 distinct word 
families. Tliis is by no means an upper limit; asL^discussed above, the 
number could be raised considerably if, for example, distinct meanings were 
counted as separate word families, or ijf oven a small portion of proper 
names were included* But given that we want a figure" comparible to 
Definition A in excluding proper names and not considering problems of 
polysemy, this can be taken as our best estimate of the number of distinct 
TOrd families, for children who can make Ui5e of English derivational 
morphology when the semantic gap between derived word and base is 
relatively small. 

Table 11 shows that for each word known most people will readily 
interpret .87 to 1.42 words that differ only in minor details of form, and 
froti 1.16 to 1.90 words which are inflections of the base word. It can 
also be seen that in the av>.rage word family, for each" base worJ, there are 
between 1.57 and 2.57 additional semantically transparent derivatives. > For 
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the child who is able to make use of SEM 3 and SEM 4 derivatives, fi^^each 
word learned there are more than three derived words with meanings 
recognizably related to that of the base, and at least two of these 
involving fairly transparent relationships. This demonstrates that the 
ability to utilize morphological relatedness among words puts a student at 
a distinct advantage in dealing with unfamiliar words. 

/ • ■ ■ . • 

' Suriniary ancj Implictions 

Measures of Absolute Vocabulary Size 

Our basic finding* has been that when a psycholinguistically and 
pcdogogically justifiable way of counting words is employed the number of 
words in ''printed school English is extremely large. Furthermore, our 
findings imply that previous low estimates of individual vocabulary sizes 
are in error.' Specifically, Dupuy (1974) substantially underestimated 
vocabulary size becau$e he underestimated the number of basic words in 
English. 

Dupuy (1974) calculated the number of basic words in English for the 
purpose of creating a vocabulary test that would indicate an individual's ^ 
total vocabulary size. This test, the Basic Word Vocabulary Test, is 
advertised as "the only test on the market that yields an estimate of a 
student's total vocabulary size, which is important for reading and general 
educational development" (Jamestown Publishers, Catalog for 1982). 

As is stated in the examiner's manual, the estimation of vocabulary, 
size based on this test does not represent the total number of words an 
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individual knows, bu.t rather^ the total number of Basic Words, as they have 
been defined in Dupuy (1974). Dupuy did succeed in giving an explicit, 
operational^definition to the co?istruct "Basic Word." It is very 
questionable, however^i whether this construct can be given the 
interpretation that the name "Basic Word" suggests. Our results indicate 
that Dapuy's estimate of 12,300 basic words in English is a gross 
underestimate of the number of distinct vocabulary items* in the language. 
Our figure of 88,533 distinct word families is larger than Dupuy's by a 
factor of seven. If \fe define total number of words in terms of items that 
must be learned indJ^vidually — counting homographs and other distinct 
meanings, abbreviations, etc.^ as separate words — the number of words in 
priated school English may be as high as 110,000. Thus, the true 
vocabulary size of an individual coUld be more than seven times greater 
than what is indicated by his or her. performance on the Basic Word 
Vocabulary Test. 

Of coursfev it is not possible to get an accurate revised measure of 
vocabulary sLz^simply by multiplying scores on Dupuy^'s fe^st by seven. The 
items ^in the test, although they may be a representat;Lye sample of Basic 
Words as defined in Dupuy, do not necessarily constitute a representative 
sample of basic words iq any other sense. In addition, while our estimate 
of the total number of distinct words in English is seven times greater 
than Dupuy's, a quite different relationship may hold between specific 
subsets of these words. For- example, the number of items among 
distinct word families, that a third grader would be likely to know\iay n^^t 
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be seven times as great as the number of Dupuy's Basic words that tall into 
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this same Category. Still, it is possible to conclude that the Basic 

Word Vocabulary Test underestimates vocabulary size by an order of 

magnitude. 

Programs of Vocabulary Instruction 

Our results indicate that the number of words that students encounter 
in reading is very large, and the results strongly suggest that children's 
vocabularies are larger than some recent investigators have supposed, ^ 
Advocates of direct vocabulary instruction have leaned heavily on the 
assumption that the number of distinct words In school English is small, 
thatt unaided year to year growth in vocabulary is modest, and that the 
total number of word meanings known by a typical child a|^any age^ is not 
large. Notably, Becker^ Dixon and Anderson-Inraan (1980), accepting Dupuy's 
estimate that the average high school senior knows approximately 7,800 
words, have attempted to lay (wft a program of systematic instruction for a 
core vocabulary of 8,000 words. \ 

Our findings suggest that high school students may actually know far 
more words, perhaps somewhere between 25,000 akd 50,000, or even more. 
Dupuy (1974) estates thai third graders know i^jily 2,000 words, but 
estimates by others are higlUr. Cuff (1930) placesThir5'-^g«^d 
vocabularies at around 7,425 \^rds, and M. K. Smith (1941), using 
'vocabulary tests based on Seashore and Eckerson (1940), set the figure at^ 
25,000 basic words. It is quiie possible, then, that the average third 
grader already knows 8,000 words. 
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A program of systematic instruction for a core vocabulary of 8,000 
words 15 not necessarily a bad idea. As Table 10 shows, if 8,000 words 
wf*-? Correctly chosen, they could cover all distinct word families found 
among words that occur at least once in a million words of tjxt. But the 
theoretical foundation of this program — taking Dupuy's Basic Words as a 
benchmark fot* t>r^ number of items to be learned — is questionable* 

There is reason to worry that Becker, Dixon, jand Anderson-Inman did not 
find the right set of 8,000 words, and, furthermore, that they made 
unreasonable assumptions about semantic r elatedness • They culled their set 
of 8,000 words from a list of 26,000 based on the Thorndlke and Lorge 
(1944) Teacher's Word Book of 30,000 words, with some adjustments to bring 
the list up to date. The list ,of 26,000 ''object words" was collapsed to 
8,000 "root words," where a root word was defined as "the smallest word, 
from wich the other words can be semantically derived* .In designating a 
root Word for any .given object word a searcli was made for the smallest word 
within the object word that contains the core meaning of the object word " 
(emphasis in the original). The assignment of root words was frequently 
the same as in the present analysis; for example, the root word of helpless 
was help . However, in^our judgement, Becker and his associates Cf ten 
stretched the criterion of semantic (and morphological) relatedness beyond 
reason. For example, all of the following words were assigned the root 
/7ord jud^ on the basis of their semantic relatedness: j ur^i , juridicial , 
jurisdtctlon , jurisprudence , jury, judicious , judicature , prejudice, 
prejudicial , unprejudiced , judiciaJ. , judiciary , judge , and judgement . 
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The problem with this grouping is the assumption that direct 
Instruction on the root words and on ^affixes would automatically result in 
a child knowing, the meanings of the whole-set of words. Becker, Dlxor, and 
Anderson-Inman (1980, p. 7) admit that "providing systematic instruction for 
even 8,000 root words is a monumental undertaking." We consider it even 
more monumental for a student> having been taught only the meaning of 
judge, to be able to identify what words were in fact related to it, and 

then to figure out their meanings. How could a child, encountering words 

<■ 

such a^ Judaic, judicious , judo, juggernaut, juggle , jugular , Julian , 
junta, and jury for the first time in text,, know which were historically 
related to judge ? Furthermore, the most important part of the meaning of a 
word such as jury is not what it has in common with the root word judge 
(this much of its meaning would probably be pretty obvious from the 
context), so much as how it differs from it. Furthermore, since the root 
words wete usually chosen to be one of- the more frequent members of a set 
of related words, it may well be that children already know many or most of 
the 8,000 root words, and that it is the "derived" words such as judicial , 
jury, and judiciary , rather than toot words like judgf , for which they 
really need instruction. 

Of course, many of the derived words were in fact transparently 
related to their root words. But because no distinctions were made among 
different degrees of r elatedness or different types of relatedness, Becker 
and his colleagues underestimate the number of words that are functionally 
distinct as' far as vocabulary learning is concerned. 
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Beck, McCaslin, and McKeown (1980) l(ave formulated an intensive program 
of vocabulary instruction which has as a \a^or aim increasing student's 
reading comprehension^ One motivation for their program was that several 
previous experimental studies have failed to produce significant increases 
in reading comprehension via vocabulary instruction (e.g., Jenkins, Pany, & 
Schreck, 1978). Beck and her associates hypothesize that vocabulary 
instruction can facilitate reading comprehension only if Che words are 
learned thoroughly — to the point where the word's meaning can be accessed 
quickly or automatically, and where a fairly rich network^of semantic 
connections between that word and others has been developed. Because of 
this, their program involved repeated ^posure to words. Children in their 
study were exposed to each word 10-18 times in a variety of tasks. There 
was also a subset of words in their study which were repeated 26-40 times, 
to see if :he additional repetition would result in even greater learning. 

Results from an application of this program in a fourth grade 
classroom are described in detail in Beck, Perfetti and McKeown (in press). 
Even with the intensive instruction and repetition, children learned 77.6% 
of the words that were repeated 10-18 times, and 86.5% of the words 
repeated 26-40 times. So it does not appear that the program was 
unnecessarily repetitive* 

How much ground did the program cover? Just 104 words were taught 
over a five month period, with one half hour per day devoted exclusively to 
this vocabulary program. At. this rate, 208 words could be covered in a 
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school year. If the program were streamlined by having all words repeated, 
only 10-18 times (that is, dropping the extra repetition of the special 
subset of words), one might be able to cover a little over 400 words par 
year* Note that Becker, Dixon^ and Anderson-Inman's program to cover 8,000 
w^rds in 10 years would have to progress at twice this rate, either by 
spending more total time on vocabulary, or' less time on each word. 

How does this compare with the amount of vocabulary that students 

M 

encounter in school? According to our rough' estimates, a child might 
easily come across a thousand or more totally new word families each year 
in his or her reading; for an active reader in t^e upper grades, the figure 
would certainly be higher • ThuSj the program of vocabulary instruction 
suggested by Beck and her asssociates could" not hope to cover half of the 
new words children actually encounter in their school reading. And the 
total number of words covered by such a program in ten years of school — at 
most around 5,000 words — would apparently constitute only a small fraction 
of the reading vocabulary of a fairly go'od reader. 

According to Beck, McCaslln and McKeown (1980, p, 8) it takes "an 
extended series of fairly intensive exposures [to a word] .before it can 
be quickly accessc ' and applied in appropriate c?)ntexts/* It may well be, 
of course, that at aticity of access is the key factor in the 
relationship of word knowledge to reading comprehension; but the puzzle 
that must be solved by those who propose to produce automatic! ty using word 
drills is how to do it in the available time, not just for four or five 
thousand words, bpt thousands or even tens of thousands of less frequent 
ones* 
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The schools have never had programs of, vocabulary instruction as 
extensive as that proposed by Becker or 'as intensive as that proposed by 
Beck. The question that naturally arises is, up to now, how have readers 
acquired their vocabulary knowledge? <^ Our answer to this question appears 
in the final section of this paper. 

Generalization to Non-Instructed Words 

A basic implication of our study is that, because of the sheer volume 
of vocabulary* that students will encounter in reading, any approach to 
vocabulary instruction must include some methods or activities that will 
increase children's ability to learn words on their own. Any attempt to do 
thi would be based on one or more of three possible emphases: Motivation, 
inferring word meanings from word parts (morphology), and inferring word 
meanings from context. 

There is basically no experimental literature that could confirm the 
success of any of these in facilitating children's learning of words on 
their own. We can at least speculate, though, on the implications of our 
findings as to the effectiveness of such approaches. 

With respect to motivation, it is no doubt an important factor. For 
all we know, it may be as important as any other aspect of vocabulary 
instruction, ''^o quote from Petty, Herold and Suoll (1968), 

I 

[MJany researchers considering ^vocabulary development pass over 
motivation without mention. No cl.>ssroom teacher genuinely attempting 
to teach vocabulary makes that mistake. ... (TJeachers reporting on 

5/j 
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favorite techniques begin with discussions of how student interest in 
word study was created (p. 19). 

Beck's program does include ^ strong motivational component. For 
instance, some of the learning activities 'took the forta of competitive 
games, and there were incentives for children to report instances of 
instructed words they found outside the classroom. * Attention to 
motivational factors did seem to contribute to the overall success of the 
instruction. Beck and her colleagues feel it may be a reason for the 
apparent increase in the experimental children's performance on tests of ^ 
words not covered in the instruction. However, further research will be 
necessary to determine whether this effect was really a generalized 
increase in word learning, the result of improved vocabulary test taking 
skills, or an artifact. of experimental design.^ 

Morphology and Vocabulary Instruction 

Our findings suggest an important role of morphology in the learning 
of vocabulary. Semantically transparent derived words are relatively rare 
among the most frequent words, but constitute an increasinly greater 
proportion of the vocabulary as one goes towards the lower end of the 
frequency continuum. 

For this reason, frequency cannot be the only criterion by which words 
are chosen to be included in vocabulary instruction. If the students only 
encountered words of fairly high frequency, there would be little 
opportunity to learn the productive word-formation processes in English 
that constitute the key to understanding the bulk of lower-frequency words. 
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The introduction of new words should be determined by family 
relationships as well as by frequency* For example, drama and drama tic are 
fairly frequent words (with Us of, il and 18, respectively), but the 
derivative forms axe fairly rare in printed texts, e.g-, dramatist (U =« 
.02) dramatize (U » .AO), and dramatization (U = .50). Teaching words 
together as a family has a number of advantages. First, if the most ^ 
frequent words in the family are already known, this procedure builds a 
bridge from familiar to new. In any case, once the meanings of drama were 
instrupted, the meanings of the derivatives could bej^overed with little 
additional effojrt* What additional time is devoted to the derivatives 
would also function to reinforce the learning of the base, word.as well. 

Another benefit of teaching words in families would be to call the 
students' attention to the word-formation processes that relate the 
different memebers of the family, so that they would be more likely to take 
advantage of such relationships on their own. In addition, covering a 
family of words would familiarize students with the types of changes in 
meaning that often occur .between related words, thus preparing them to deal 
with cases in which the semantic relationships among morphologically 
related words are not so transparent. 

It should be remembered, however, that our definition of word family 
is based on relationships among existing words in English, not on 
historical roots, and on semantic relationships that are transparent enough 
for students to perceive on their own. We remain highly skeptical of 
approaches to vocabulary that proceed on an etymological or historical 

Co 
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approach to word meanings , approaches which feign that words such- as 
dialed^t , collect , and Intellect' liave ^some basic meaning in, common. There 
may be some perceptual or mn'emonic value to analysing words into 
historically-based components, but this remains to be established. 
Shepherd (1974) found that knowledge of Latin' roots (e.g.., -ceive, lect) is 
not strongly related to the knowledge of the meanings of words containing 
such** roots (e.g., receive , collect), whereas knowledge of stems which ^- 
themselves are English words (e.g., sane ) is strongly' related ,to knowldd^ge 
of the meanings of related derived forms (e.g. , saqityj. - The type of 
relatedn^ss among words analysed in the present study, along with its 
associated implications for instruction, is not to be confused with the 
etymologjical or historical approach adopted by some. " 

Learn/ng Word Meanings from Context 

TKat word meanings are learned from context is an inescapable fact. 
Many, ninth graders, even more high scliool seniors, and almost all educated 
adults would be able to read with comprehension through any school 
materials for grades three through nine with a high level of coraprehensJLon. 
Thi^s^^pr^suraably requires knowing a large proportion of' 88^ 500 distinct word 
families. These words could not be acquired from direct instruction or 
from looking them up in a dictionary. There is only one other pocsible 
source of knowledge: Inference based on context* Thus, logic forces the 
conclusion that successful ^readers must learn large numbers of words from 
context, in most cases on the ba:;is of only a few encounters. 
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It is hard to conceive how a word such as lJ[, for example, could be 
learned In any other way than from verbal context. Pointing to something 

in the world that corresponds to the concept of hypotheticality would be 

/ 

difficult to say the least, and any child old enough to understand a non- 
circular definition of i£ is surely already able to use the word fluently. 

Good readers may acquire large vocabulries exactly because they are- 
better at idferring word meanings from context/ One indication of tl^is is 
the fact that a cloze test is a satisfactory measure of reading 'Ability. 
While a cloze test is taken as indicating overall readlsng ability, the 
skill it measures most directly is the ability to use contextual 
information to supply the meanings of words missing from text — a task 
analogous to that gi identifying the meanings of unknown words in context. 

Knowledge of morphological relatedness among words proably contributes 
importantly to learning word meanings from context.^ Our findings here 
show that a large number of infrequent words are transparent derivatives of 
>.^^other words, in many cases of words the student is likely to know already. 
While context often is not sufficient to determine the meaning of an 

unfamiliar, word, it may provide enough Information to permit a guess at the 

\> 

appropriate meaning of a word whose semantic content is^ partially 
determined by its morphology. A child who knows the mianing of drama and 
the function of the .srCf ix -ist will need only minimal help from context to 
deterijiine the meaning of dramatist . A hypothesis that should be explored 
in future research is that joint utilization of contextual and 
morphological information is a strategy employed by children who develop 
large vocabularies. 
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We hypothesize that the principal engine driving vocabulary growth is 

i 

volume of experience with language. Oral language experience is important, 

of course, particularly for the young child, but we judge that beinning in 

about the' third grade the major determinant is amount of cee reading. It 

is a surprising fact that there are no satisfactory estimates of the number 

<^f words read per year by children of different ages. Earlier we guessed 

that^the least able and motivated children in the middle grades might read 

100, 0)s^ words a year while averag^ children at this level might read 

1,000,000. The figure for the voracious middle grade reader might be 

10,000^000 or even as high as 50,000,000. If these guesses are anywhere 

near the mark, there are staggering individual differences in v^olume of 

language experience, and, therefore, opportu.iity to learo new words. 

Notice also Lliat variation of this magnitude could readily explain 

diffe ences between good and poor readers in autoraaticity of word access. 

The only thing problematical about the "rapid learning from context" 

theory is that experimental studies generally have seemed to show that 

children do not learn word meanings very well from context. For instance, 

Jenkins, Pany and Schreck (1978) found that exposure to words in context 

produced little increase in knowledge of their meanings, and no measurable 

Increase In the comprehension of text containing those words. Two factors 

may account for this finding. First, there is reason to doubt whether the 

contexts used in this experiment were really suitable for learning the 

/ 

meanings of the new words. Second, as Jenkins, Pany, ?nd Schreck suggest, 
it may be that readers can encounter a substantial number of unfamiliar 
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words in a text and still comprehend it fairly well, especially they 
have some acquaintance with the general subject matter. Whate'-^or the 
explanation, the failure to find ixperimental evidence for coaC«3ttual 
learning of word meanings ought to be regarded as a conundrum foe 

/ 

experimentalists rather than the basis for educatljanal policy. 
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APPENDIX A ' 



Categories of Relationships Among Words 

Morphologically Basic Words 

This category includes any words which cannot be described as related 
to some more basic word via some productive or seinl-productive word 
formation process* First of all, this means any raonomorpheralc words, e.g., 
add , foil , or wind. It also Includes words that might be considered 
multiraorpheraic In a historical sense, but which do not seem analysable in 
terms of the word-formation processes of modern English. 

Operationally, this category is also the "none of the above" category, 
that is, the classification of words which do not fall into the other ^ 
relationship categories in our coding system. However, if we have bent 
criteria, it has normally been in the direction of coding an item in some 
other relationship category. For example, the category of "idiosyncratic 
morphological relationships" was used to categorize relationships (e.g., 
between knowledge and know ) which would not be considered productive word 
formation processes of modern English. 

This category also includes those items which are morphologically 
basic as far as the American Heritage Intermediate Corpus is concerned. 
For example, the word imposters occurs tn the corpus, but not Che singular 
impost er . Since no other words related to this item occur in the corpus 
either, It was coded in the category "morphologically basic with respect to 
this corpus." Items in t.Ms category were Included with the category 
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'morphologically basic words" for the pih^Rosg^^er counting types of 

r elatedness, although they are also distinguished from the truly basic 

0 

4 • 

words by a special flag. 

Simple Capitalization 

This category includes all items in the cor:pa8 which differ from some 
other existing item only with respect to capitalization. For example, 
Teacher differs from teacher only in the capitalization of the initial 
letter. This category is called simple capitalization in that it does not 
Include cases of capitalization homogcaphic with a proper name, e.g., Jets 

) 

or Earl > Su^u items are included in the category "Capitalizations 
*homographic with proper names," discussed below. 

Alternate Spellings 

This category includes those items which differ from some other item 
only with respect to spelling. For example, cart- horse is treated as a 
spelling varient of carthorse ^ In many cases, th's category was used for 
misspellings which occurred in the corpus. 

Alternate Proaunciattons 

This category was used for items spelled in nonstandard ways to 
indicate pronunciation, for example, f ishin ^ , or crrrack . 

Alternate Form of Word 

This category was used for alternate forms of words such as soya and 
soy, hurray and hurrah, or britches and J>reeches, where the difference tn 
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spelling reflectes a dif fereaca In pronunciation, but. one which involves 
the phonemic form of the word. In other words, this category covers minor 
differences in lexical form, whereas the category "Alternate 
Pronunciations" covers differences which might be thought of as resuxttng 
from low-level phonetic rules, 

Alt e quate Forms with S 

This category is a special case of the previous one. It includes 
those minor variations in, lexical form which consist of the presence or 
absence of final £, as in toward and towards or amidship and, amidships » 
For lack of a better category, the pair amid and amidst is also categorized 

"1 

here. 

Regular Inflections 

* 

This category includes all items relate'd to their immediate ancestors 
by regular inflection — tfhat is, itema which differed from other items only 
by the endings £ (es), eJ^- dng ; and Since the WFB provides no 

context, it was not possible to distinguish between contractions (Jphn's =' 
John l£) and possessiyes. Therefore, in cases where a f^^ ending in could 
be interpreted as a p*ossessive, it was included among the regular 
inflections • 

In the coding system there was a distinction made between regular 
inflections (i.e. plurals, possessives, past tenses or past participles, 
and third person singulars of verbs) and instances where ed or ln« result^ 
in words with distinct syntactic and perhaps also semantic properties, as» 
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.in the case of spelling , planking , crowded , and elevated * This 
distinction, however, was often difficult to make. There are some cases, 
such as dress / dressing , where there are sustantlal semantic shifts between 
the two words; about 20 such items were found among the words coded. In 
other cases, the semantic differences are a little less pronounced, as in 
the case of spell / spelling . The semantic aspect of the coding system will 
have captured the important^.dtf terences between the^e types of 
relationshipG, For the purpose of the overall counting, it was decided to 
lump together all' regular inflections, including items such as spelling or 
dressing. The semantic codes cau be used to distinguish such cases when 
necessary. 

The following categories ware coded as distinct from regular 
inflections;: 

a) Semantically irregular plurals such as top/tops , air/airs , and ' 

premise / premises * 

b) ''Scientific" plurals such as genetics and genitals . 

c) InQorrect regular inflections such as knowed * 

d) Alternate forms of words with s, such as skyward / skywards . 
Only 21 of the 7260 items coded fell into thede last four categories. 

Irregular Inflections i 

This category includes irregular plUrals of nouns ( mouse/mice ) , 

irregular past tenses and participles of verbs ( tear /to re/torn ) , some Latin 

\ \ / * 

plurars ( larva / larvae ), and also suppletlve forms such as I^, me, mine . 

Also Included In this category are suppletLve forms of the verb to be, for 
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example, is, arc , . was , were , been . Included as well in this category were 
relationships such aa our/ours , and roy/mine . 

As with regular Inflections, there was a separate coding category for 
irregular inflections th^t: resulted in. distinct words with different 
syntactic (and sometimes semantic) properties. For example, known 
functions as an adject,lve (a known criminal ), as well as a past participle 
(he should have known th^ answer ). As in the case of regular inflections, 
this distinction was sometimes difficult to make, and was not i^icorporated 
into the counts presented here; both types of irregular inflections were 
lumped together. Cases where there is a distinct semantic difference 
between the two synntactic uses of che word can be identified in terms of 
th.e semantic coding distinctions to be discussed below. 

Regular Comparatives and Superlatives 

This category includes forms such as faster , slower , quickest , and 
highest^ . 

Irregular Comparatives and Superlatives 

This category includes forms. such as better , best , and worst , 

Suf fixation * ^ 

Target Items related to their immediate ancestor by , suf fixation were 
divided into four ca::egories: First • what could be called "normal 
suf fixation." This is^ best defined in terms of the three remaining 
categories which can be distnguished from it. The second category might be 
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called "suffix replacement. This category Is used for those cases In which 
the target word has a different suffix than Its Immediate ancestor. This 
will necessarily be the case when the stem does not occur in English 
without an affix. For example, the immediate ancestor of aggressive is 
aggression (cf. Aronoff, 1976). Similarly*, the Immediate ancestor of 
enthusiastic is enthusiasm > The same holds for pairs such as 
chloride / chlorine , or stenographer/stenography * It was also decided to 
treat pairs such as fragrance / fragrant and omnipotence / omnipotent in this 
fashion. 

A third .subcategory of suf fixation Includes those cases where the 
addition of a suffix is accompanied by unpredictable changes in the form of 
the stea: for example imp 1 1 ca t lo n / imp ly , apathetic/apathy , 
negll^ent/neglecft , or sensuous / sense • A fourth subcategory of sufflxatlon 
was used for those case lu which it seemed proper to analyse a word into a 
stem + suffix, even whei.^the stem itself was not an English word. For 
example, nomin + al, cruel + _fy. Only three cases of the 7260 items coded 
were put into this category. 

P ref Ixatlon 
—————— • f 

Target Items related to their Immediate ancestors by preflxatlon were 
similarly divided into four categories: Examples of "prefix replacement" 
"^are pairs such as deer eaae / incr ease and descend / ascend . Cases where 
pref ixatlon* Involved unpredictable changes In the form of the sterr. Ihcluded 
Impoverish/poverty and mishap/happen. 
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No cases Wete analysed as' prefix + bound stem. 'This. would be done 

only where there was some justification for assigning some specific \ 

semantic content to the stem; this cannot be done in cases such as deceive , 

perceive, or receive (cfi^ Shepherd, 1974). 
' ' \. . . 

Compounds 

"^'^ Compqunds were coded into seven subcategories. 

> >... ■ ' ■ 

Fiist, there are regular compouads — those which do not fall into any 

• ■ ^ ; t • * ' 

6f the following special^ categories. Second are hyphenated, compounds which 
'\ ' ^ • . ' 

do not meet criteria for any of^^thje following special categories. The 

" • • ; X ' ' 

difference betv/eea these first two categories is- simply spelling. It is 
no*t clear whetlier, hyphens are used in compounds ^with any regularity or 
consistency, but it seeme^d best to code the two types as distinct , since 
the categories. can always be collapsed afterwards. We have not made any 
use. of the distinctions dmoog compound types in' the analyses presented 
here. * ^ 

Third are hyphenated compounds with the internal structure of phrases 
,or sentences~for example: doctor-*tO" *be, fissioa- fusion-fission ,. 
twenty-year - old, or llve-and-let-live . 

A fourth ."category of 'compounds are contractions, such as can ^t, 
daddy^ll , nobody "'d, nnd would ' ve . 

A fifth category of compounds was used for erases where the component, 
parts of the compound were not free stems in English, but could be assigned 
a specific semantic value; for example omnipresent , cartography , theology , 
or automobile. 
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A sixth category of compounds was' used for those involving an 

a.dverbial partite: wind-up , burnout » hookup , and tle-sLn * A- final 

categoty was used, fot*- compounds.. such as cranberry Ibr cha^iberlain wl\ere one 

element was clearly a meaningful unit in English, but the* other .was riot a 

wprd in English,, nor could it easily be assigned any specific semantic 

value • * • , 

• • • • 

Truncations . 

' . • *■ 

This category was used for the relc^zionship' between ^uch pairs as 

r hinoeerous/ rhino racooa/^coon/ and gentleman/ gent • These ca^es were 

distinguished from abbreviations, such*a*s Mich for Michigan *. 

1, ! - - . T , ■ - - ^ « ' 

Idiosyncratic Morphological RelatioashiRfi 

This category was used' for items which seeme'd to show a definite . 

mdrphological relationship with some immediate ancestor, yet which did_ not 

seem to belong tn the other categories. Often, this involved a difference 

« 

in* form that could be thought of as a suffix, b^ut was not productive at all 
in English* For example, there were pairs sue as: largesse/large , 
prophesy/ prophecy , musicale/ musical , planetarium / planet , or kno wl edge /know * 

Ambiguities 

The WFB was collected by computer, with **word" being defined as a 

string of characters bounded right and ,left by spaces. This definition * 

♦ 

treats as distinct words any graphically distinct, types, no matter how 
trivial the difference. It also lumps together any graphically identical 
types, no matter hoV semantical ly diverse — all the different meanings of 
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bat, or mean, or bear * It would not be. possible, and 'hence it was not our 
intention, to disambiguate the items in this corpus. We have dealt with 
' one specific type of ambiguity, however: what could be called morpholxjgical 
ambiguity, or ambiguity of relationships category* I1iat Is, we have tried* 
to represent amblg icy wh'^n it In^lved a word being ^analysable into two or 
mere of our. categories of relatedness. A'^rd such as bat, for example, 
howe^ver many aeanings it roay have, falls into^oaly one category of 
relationship; it Is a morphologically basic ward. Tne word bats , 
similarly, may have a auraber of meanings, but its relationship type is 
unambiguous: it is a regular inflectl^on of bat. The word felt, on the ^ 
—other^and ,- ts-^m^biguuas^dLnr irerras-tjf itrs-morphtrrogi^al ^elatlonshii>s^"^On 
jLe hand, it is an irregular past tense of the. verb feel (which may of 
course^ave any number c^f meanings)* On-the other hand, it is a 
morphologically^. basic word as well* 

A ^rd such as felt was coded as being related to two" (or more, when 
, necessary) items, fettl and' felt2 * . X^ese latter items, by .definition 
ui ambiguous with respect to^their morphological relationships, were then 
, farther analysed as any other items in the list would be. 
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APPENDIX B ^ , 
Target .Word - Immediate Ancestor Pairs 
Illustrating SEM 0 



TAFGET WORD ' 

sjanselessly 

sensibly 

chlorination 

cleverly 

'cleverness 

daintines's 

decentralization 

dese<irati6n 

desegregation 



IMMEDIATE ANCESTOR 

senseless ^ 

sensible 

chlorinate 

clever 

cleV^r 

dainty ; 
decentralize 
\_ ; desecrate 
desegregate ' 
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Target Word - Immediate Ancestor Pairs 
Illustrating SEM 1 f 



TARGET WORD 

elfin 
geneticis't 
misrepresent 
fragmentary 
litigant 
suabonnet 
entKuilas t 
washcloth 
collectively 
: anywhere 
crowded 
various 
lower-class 
wily 

wind-twisted 
yummy ^ ^ 
.Botanic 



IMMEDIATE ANCESTOR 

V 

elf 

genetic 
►represent 
fragment* 
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* 


/ Target Word 


- Immediate Ancestor Pairs 


^' 




Illustrating SEM 2 














TARGET WORD 

A « 


iHmediate ancestor 


> *- 
I ' * 

5 

^(-V * » 


• 


therapeutic. 
* - - oUuncty 
gunner 

rogxxgncs » 
uncountablfes ^ 
,_ — cow^nsna 


therapy * 
gun 

• f ^8 

uncountable 




■ /■• 


mairiiy 
additional 
knowledge 
once 

everyday • 


' * main , 

addition 
know 
one 
every 


■f^ ^ > • 




skyrhigh 
' space-sick , 
sfririgy ^ 
sun^suit 
sunburn^^ 


" sky 
space 
string , 
sun 

, - sun - ' 




theorist 


thebry . * 






1 














« 







/ 
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Target Word - Immediate Ancestor Pairs 

V 

Illustrating SEM 3 



TARGET WORD 

password 
handspring 
collarbone 
airfoil 
bloodshot 
sensor 
SKydiyer - 
tweeter, 
visualize 
washroom 
apeak 

Sunday-school 

hookworm ^ ' 

inlay 

mishap 

moonship 

noblesse 

ominous 

passenger-iniles 

pasteurize 

percentile 

planetarium 

broadax 

chloride 

collinear • 

conclusive 



doctrinaire 
jelev^tor 
f ishwheel 



IMMEDIATE ANCESTOR 

pass 

hand 

collar * 

air 

blood 

sense 

sky 

tweet 

visual 

wash 

peak 

Sunday ' 
hook ' ' 
lay 
^happen 
moon 
noble , 
omen 

passenger 
Pasteur 
percent 
planet 
, broad 
chlorine 
linear 
conclusioa 
doctor 
doctrine 
elevate 
fish. 



/ 



Target Word - Immediate Ancestor Pairs 



Illustrating SEM 4 



TARGET WORD' 



IMMEDIATE ANCESTOR 



crowbait 
saucepan 
fender 
vitality 
highr-school 
sauce^ 
artllficial 
apartment ' 
colleague * 
condescend 
go-getter 
impregnable 
impressionable 
moonstruck 
negligible 



crow 
sauce 
fend 
vital 
high 
sauce 
artifice 
apart 
league 
■ descend^ 
go 

. Impregnate 
-impression 

: • moon 

"* neglect 



/ 
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Target Word - Immediate Ancestor Pairs 
Illustrating SEM 5 



TARGET WORD 

• ■: .^^ ' 
dog-days T 

Burraa-Shave 

prefix * 

peppermint 

shiftless ^ 

misgive 

poochte-ples 

crowbar 

foxtrot 

livelong 



IMMEDIATE ANCESTOR 

dog 

Burma 

fix 

pepper 

•qhift 

give ^ 

pies 

crow 

fox ' 

live 
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APPENDIX C 
Types of Words in* the Corpus 



One issue in determining -vocabulary size is deciding what types of 

words to count, i-e» wheth?»r to includ^^ proper name^> abbreviations, 
♦ * ♦> 

nm^bers, and so -on. ' We, used the following set, of categories to classify 
h^-'-ltems in fhe WFB : / \ 

Proper Names 

Thi^ category was used primarily fc»r names of specific\*indivlduals 

for names of geographic5^aces. 



(whether Mstorical or ^fictional), anj 
T/ords directly derived .from such proper names (e.g« American , Burcj^se, 
British-controlled ) were also included- Coded in this category as well 
were days of the week, months, and^ names* of companies and organizations (as 
well as abbreviations of such' names, i e.g. , AMF , AKC). Cfeipitallzatlon was 
taken as evidence, .but was not used as a criterial factor • 



) 



Items HomogfTtphic with Proper Names * ' ^ 

In many casesy a, capitalized word could be Xaken either as a proper 
-name part 'of a i>roper name), or else a common noun capitalized for soni6 
other reason: -e»g», Dodge , - Dre^ , 900k , Dippdr , Campfire , Earl , Hood ,> Jets * 
(Because of the way the WFB was collected and keyp.uncl^ed, many common nouns 
occur both in capitalized and uncapitalized form.1 The category of items 
hpmographic with proper names was grouped together with the category 
"Proper Names'* for the purpose of the analyses repoifted^^^here. This Is 
because they allow interpretation as proper names, and t^heir uncapitalized 



} 
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versions have been, already included in determining tne number. of non^proper 



^names* 



Numbers and Formulae 



This category includes types such as AOG,. MCVII, ^NXNy R5, 108?, and 




85::. ^ 



Compounds or De'rivatiyes Rased on Numbers * 



This category includes types ^uch asj^;32nd, 106 -ton , 17th-century , and 
82*"degree; 



Abbreviations 



Only twelve itenis. of the 7260 coded fell into this cagtegpry: They- 



were fps, Md, NW, ./PX, ,Rw, RW, TD, Te, MD'S, Doetr, and Ave* pictlonarieb 



i'eh 



were^used to distinguien abbreviatiotis from formulae. The^ su 



bject 



categories in the WFB also helped determine ^the proper interpretatl^on of 

^ i 

some items; for example, if the type AOG occurred only in Mathematics, it 



would spech to be best interpreted as a formula (probably the 
angle*), rather than as the name of some organization. 



n^me of an. 



Foreign fj^ords 

This category included words recognizable as belonging to languages 
other than English, were were not found in the reference dictionaries used; 
for example: * ponere, daeghwamlican , Roman! , les , las , Irae » decern , and 
noire. 
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Nonwords 



In this category wfere listed items which were not found in the 



reference dicti6natj^a used (including Webster's Third New International, 
unabridged)^ and which could not be assigned to any of the other coding 



categories diaicusfJed here^ Some of the items found in this cctegory are 
clearly^onomatopoetic: putt-putt-putt or wh-i-s-s-s-t^ Others may be 
deliberate, coinages, such\as. yugit , cltcket, or pickle. Still others may* 
be noncapitalized versions of unfamiliar proper naraep (raaribou, faeger), or 
misspellings of other words • The total number of items in this category 
(14^5 is sm^ll enough so t^iat reclassification of some of them would not 
.have much' effect on the overall distribution of types in our analyses. 



"WFB Errors* 



A final .category was used> f or 6^ items which weru erroneously repealed 
In both the ibook and tape versions of the WPB » 



\ 
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^' Footnotes 

The research reported herein was supported by the National Institute , 
of Education under Contract Nci* US-NIE-C--400-76-0116. ' 

1 ' > . . , 

It should be noted that the addition of such itfems to the list does 
increase the overall size of the list , but does not inflate the number of 
items in any given category* To illustrate this, consider a hypothetical 
^ list consisting only of the words abatement,' abates , abated , and after * As 
:lt/stands, the total length of the list Is four i-tems; in terms of 

V. » * * 

rjelatioaship categories, \thei:e wou^d ie one Instance of suffixatlon, two 
instances of regular Inflection^ and one basic word* Our goal, however, is 
to define the count so as to have it reflect the number of word families in 
a corpys,^ for any given, definition of word family that can be constructed 
in terras: of our coding system* For Example, assume that we want to know 
the number of distinct word families in this hypothetical corpus for a 
child who understands regular inflections, but who has not yet internalized 
any rules of -suffixatlon* For such a* child, there would be three distinct 
word families in this corpus: One containing after , one containing abates 
and abated, and Due containing a batement * (We had assumed that the child 
at this point did not recognize the connection between abatement and 
abate * ) If we add the missing ancestor abate to the list ,^6 arrive at the 
number of distinct wor.d families, we simply take the number of basic words, 
plus the number of items in any relationship type not yet mastered 'by the 
child at. the level of 'linguistic development in question. In this case, 
the corpus would contain abate (the missing ancestor of abates and abated). 
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abates , abated , abatement , and after • That is, two basic words, two 
regular inflections, and one^instance of suf fixation. If we want to know 

how many word families ajre in the corpus for a child who has internalized 

/ 

the rules of regular inflection, but^not those of suf fixation, we arrive at 
. the count of three* • For a child who has alsd mastered suxfixation, there 
are only two distinct word families in this corpus. 

Thus, the addition of "nissing ancestors" to the list does increase 
the overa*ll number of items, but it does not distort the count of items in 

i 

any .given relationship category. The same holds for items added to 
disambiguate morphologically ambiguous target words. Consider a 
hypothetical corpus consisting of the followJ^ng items: feel , felt , go, 
vent and after ♦ We would want to say that there are four morphologically 
basic words, feel , go, after , and the noun felt ^ We would also want to say 
that the list contained two irregular inflections: went and felt . Thus, a 
morphologically ambiguous word like felt should be counted in each of the 
categories to which It belongs. ■ . 

Thus, tabulations of 'tlie number of items in various relationship 
categories wilL include added entries which are disambiguations and missing 
ancestors, in determining the composition of the sample and the corpus. 

There were also certain items added to the list during the coding 
process which were not included in tabulation of relationship types. For 
example, compounds were given a separate entry for each component part. 
This was because the relationship between farmhand and farm , for example, 
might be quite different than the relationship between farmhand and hand . 
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The first relationship is semantically transparent; t-he second involves a 
secondary meaning of hand rel^ti^d^Co the more primary meaning by a metaphor 
(metonomy might be the more accurate , term in this case) whicth might not be 

' ' ' •• ■ y 

immediately transparent to an elementary school child* Tn any case, for 
5ach compound, additional items were added to expiei^s the relationship of 
the compound to. each of its component parts. This added items were not, 
howev.er, counted in the tabulation of the number of items iti,^any given 
relationship category? * 

y , . . . . 

In the .tabulation of compounds for different levels of semantic 
\ . ' . ' • . 

transparency the two code^ for each compound were collapsed, and the* 

compound was assigned 'the degree of semantic transparency associated with 

the least transparent of its' members. This reflects the assumption that 

the difficulty of learning a new compound such as farmhand is determined 

largely by the difficulty of learning the least semantically transparent of 

its domponent parts.. 

The values in our -estimates for the pbpulation of words in printed 

school English were calculated as follows; First, the items in our sample 

were ordered by frequency, and divided into seven strata containing equal 

.numbers of items, each representing a band of frequencies. From Table B-8 
► * . ... 

In the WFB (p. xxxyi) , t;Jhe number of words in printed school English within 

each f requen<^. band was determined* A weighting factor was assigned to 

each stratum representing the ratio of the number of words in the 

population within that frequency band to the number of words^ in the 

corresponding stratum in our sample* 
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• The size of* the WFB , even as. large as it is, creates an artificial 
^floor**'for the reported frequencies. That is, any word, however low its 
"true" probability or frequency, if it occurs in the corpus at all, will be 
assigned a certain minimum frequenc3r valuer Tne U-values (estimated 



frequency per million) of the 35,079 hapax legomena in the corpus were 
adjusted according to the amount df text from the subject categories in 
which .they occurred. TKe result of this was that the second from the 
lowest frequency stratum in our sample had an artificially small frequency 
range (in terms of reported frequencies), and hence an unrealistically low 
waightiog factor in the Initial estimate. This was corrected by plotting 
the final weighting factors on a smooth, essentially exponential curve 
determined by the value of the other weighting factors and by the 
constraints .on the value of the sum of 'all weighting factors. , ^ , 

The actual weighting fetors had the following values, expressed in 
terms of how many words in the population a single word in each stratum of 
our 726(>-word sauiple would represent. \ ^ 

STRATUM FREQUENCY RANGE ^JEIGHT / 

/ 

i 





LOWER U 


UPPER U 


1 


I 


.0004 


.0109 


314.80 


2 


.0109 


.0150 


121.17 


3 


.0150 


.0457 


64.78 


* 

4 


.0457 


.1176 


3aT-39 


5 


.1176 


.4071 


23.39 


6 


.4071 


* 

2.0430 


13.80 


7 


2.0430 


7456.8281 


li.72 
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The weights given are those relating our sample to the ^pppttlatioti; the 



relationahtps_ between the WFB and the population could be represented by 

. dividing those weights- by 11* 9478 ♦ 

We also wanted to determine the extent to which the choice^ of 
weighting factors influenced our final estimates of vocabulary size. 
Therefore, we tried calculating estimates for the total population on the 
.basis of a number of sets of weighting factors — the original esttmates, our 
adjusted smooth exponential curve, and a number of * exponential functions 

'which In effect defined the extreme values of functions that could be drawn 

through' the ppints determined from the tables in the WFB* 

Our final weighting function gave us an es^-^mate of 45,453 ^ 

morphologically basic words in the population* The other sets of weighting 

factors gave estimates ranging between 45,,285 and 47,418 morphologically 

basic words, a rang^ of only 2,133* Thus^ any reasonable variation in the 

weighting factors would le'ad to only very small differences in the values 

of our final estimates* 2ven for those categories more skewed in terms of 

» 

frequency than wete the morphologically basic words, the estimates based on 
the different sets of weighting factors were very close. 

We also calculated estimates for the populatibri by assigning x^eighting 
factors to words individually on the basis of the fun<^tion 



W » 11.9478/(1 - (1 - p)^^) 
where 11.9478 is the rfuiaber of words in the WFB divided by the number of 



words in our sample, and p is the probability of a word, that is, 

of 



U/1,000,000* The expression (I - (1 p>^) is the likelihood of a word 



Words in School English 

111 

with probability £ occurring in a corpus of n running words; hence it is 
also the proportion of words with probability £ that should occur at least 
once in a\corpus of n words. This forrtula gave us Essentially the saoCe 

res&lts as our earlier calculations., ' . * * 

, . / - 

/"* * ' ■ 

Note \^hat items added t6 the original sample in thet coding 
process-- 'ml^^ng ancestors and disambiguations — were not Included in the 
process of estimating the composition of the population. The procedures 
rfe,. for extrapolating from the sample to the population already account for 

♦ words that do not occur in the WB, so to include items added to our sample 



in these .estimates would have amounted to counting them twice. 

Morphologically ambiguous ..items were also not included in our 
; «5S»- projections for the^ population, J^cause there was no way to accurately 

assign a frequency to the diff«ent analyses/ each ambiguous form allowed. 
There was a relatively small number of morphologically ambiguous words in 



4 



\ 



\ 

\ 

\ 



our. sample (19 altogether), and an estimated 292 in the entire vocabulary 
of printed school English. Even if each of these were three ways ambiguous 
^(definitely an overestimate), this would add less than a thousand items to 
he^to ^al p opj^l^ion, and these would be scattered among various 
categories. Inclusion or exclusion of these items in our estimates 
therefore makes no meaningful difference in the size of the categories we 

will be considering* 

3 ' ' 

\ Main entries in Webster's Third meet the fpllowing criteria: 

\ — ' 

\ First, plurals and verb parts are included under' the main entry of the 

\ " ' ■ • ' 

uninfl^cted word, unless they would fall alphabetically more than five 
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inches away from the main entry, in which case they are listed as a 

' * ■ \ 

separate main J^S^^^ theic appropriate alphabetical order. For example, 
bows,, although it is a regullar plural of bow, is listed as* a separate m^ln 
entry, because there are more than five inches of intervening words, *e*g.' 

/ t : \ 

bowie , bower , bowel. The same principle is followed for comparatives and \^ 
superlatives, as well as varients in spelling. This means that almost all ^ 
irregular plurals or verb forms, as well as many regular^ plurals and verb 
forms, wy.1 be listed as separate main entries, . ' ' 

Homonyms are given separate main entries, ''distihguished by initial 

superscript numbers, Howevor, to facilitate comparison with Dupuy^'s 

I 

estimate of the ntuttbe^r of main entries in Webster ^s Third , we will follow 

Dupuy in not counting horadnyms as separate main entries, >^ 

Tliere are two forms of* run-on entries*^ ^irst, idioms and phrases 

based on the inain entry word aref listed as run-on entries under that main 

entry* These phrases and idioms are given separate definitions. Second, ^ 

certain derived forms are also listed under the maoLn entry, namely, forms' 

derived by suffixes such as -ness or -ly. Not all \such derived forms are 

thus included utfder the main entry^ For example,^ quickly is listed as a 

main ^try Separately from quick . The following criteria are used for 

Including a derived form under the main entry as a run-on entry: First,, the 

derivatives have to occur in alphabetical order. This presumably means 

that a derivative which would be separated fiom its main entry by 

intervening words would have to be listed as a separate main entry, 

* 

Secondly, such derivitives are given without definition, presumably because 
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their meaning is totally predictable from the meanings of the base and Che 
affix. Therefore, any derivative whose^^meaning was not thus 'totally 
predictable would be list^ as a* separate entry. This sumraarizes the 
principles accordirtg to which types are grouped inter main entries op: split 
into distinct entries. 

*As to the types of items included in the dictionary: First of all, 
only certain types of proper names are includea. Names of persons and 
geographical place names are not listed in the dictionary* However, some 
other types of names are listed,, for example, names of tribes and peoples, 
and words derived from names of persons or places* For example: The word 
wit chit a is included as a name of ts^e Amerindian people, and as' an 
■^adjec^tive based on the city name, but the city name itself is not included 
as an Item in the dictionary. The proper name Tito is h<a/t found in the 

■ ; ■ \ 

dictionary, but the noun Titoism is* The name Tiv (a people in Africa) is 

\ ' ~ ■ * 

incliided, as ^ell as adjectives such as Wicklif fian . 

Ivrabic numerals are not included, with the following exception: 
certain compounds, for example, 2"-D> are included', but alphabetized as if 
they were spelled out. Compounds such as ninety-one , nihety-tvo , and 
ninety-tVree are also included. * / \^ 

Symbols, corabJ^ing^ forms (e.?., pseudo-) and symbols (as for elements) 
are aljso included as dictionary items. ^ 

Compounds are alsoj given as separate mai^i entries. This includes 
compounds which are written as two separate words, e.^., luna moth or heat 
exhaustion. 



Words in School English 

' ' ^ , ' 114 

- * . s 

* / 

A ^ ' 

^ In principle, Webster^s Third* inludes comp^ourids containing numbers, 

- / ' / 

alphabetized as if they were spelled out» In practice, there are very^few ' 

such items in this dictionary, one example being None of the items in 

our sample 'coded as "compounds containing .numbers" would have been limited 

^as entries in ^ebster ^s Third , so /fchis entire category was excluded. 

In the category ^f rionwordsZ '^.7 items in our simple were prefixes and 
/ • / ^ 

suffixes that would be listed In Webater's Third. However, Dupuy's (1974) 
calcirlation of the number of main entries in Webst-r's Third , which we will 
be making use of, excludes such entries, so we will also exclude these from 
our estimate. .* \ ^' 

Only a, very small fractioiif^f^f^i^e"* alternate spellings in our sample 
would have^appepred as separate entries in Webster ^s Third . Most of them 
are either deliberate or accidental misspellings, or words spelled in some 
unusual way, for example with hyphens to show syllabification. The small 

percentage of .'items in the category of alternate spellings that would 

^' , . • *• . ' . ' 

consti'tute separate dictioijary entries Vas taken into account in ojur 

estimate of ."Webster main entry equivalents." 

Although Webster Third does contain some words that might be 
considered "foreign^" 'one ctiterion for coding ah itaid in our sample as ^ 
"foreign" was that i.t not be listed in Webster's Third * Therefore all items 
in. this category are excluded from our estimate. 

Regular inflections with* distinct meanings, e.g., experienced , 
collected ; "heaping , conditioning , tried , are given separate entries in 
Webster '^ s Third. Such items were therefore included in our count of 
"Webster main entry equivalents." 
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There is some reason to believe that at least at the higher end of 
the scaley. scores on Dupuy's test may Under Jstlma^e an individual's true 
vocabulary size- by less than a factor of seven. The single largest factor 
ctritrdbuting to ithe difference bet'^^-^ti ^puy's estimate of the number^ of 
words in English and ours was his exclusion of words that did not, occur as 
main entries in all of the four large dictionaries he used* ^^^^^umably the 
words that were excluded ,on this principle would on the average be harder 
or less likely to be known than words which did appear, as main entries/ in 
all four dictionaries* ^Thelrefore, Dupuy's sample of wbrds would contain a 
higher proportion of easier words than would be drawn from a complete range 
of 88,500 word famillejg. 

On the other hand, as already mentioned, it is our estimate of the 
number of distinct wofd families that is about seven (£lmes greater than 
dupuy's estimate' of the number o^" Basic Itords in English If one takes the 
position that distinct meanings should be counted as separate woros, 
Dupuy's ^test underestimates the size of an individual's vocabulary to an 
even greater degree* 

Beck, Perfe^i, and McKeown' (in press) macch^.d children from 
different intact classes on^'the basis of pretest scores* Some t>f the 
control subjects were drawn from a combined third and fourth grade class. 
Thrs class may have had lower reading attainment than the other classes* 
It is well known that matching does not eliminate preexperlmental 
differences when the populations <?ampled are different (cf • Campbell & 
Boruch, 1975)- 
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^Anderson and Freebody (in press) have shown that good readers in the 
middle grades aggressively apply morphological principles to hypothecate 
meanings for unfamiliar words. * ^ ' 

\ 



■ \ 




. I 
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Table 1 

A "Word Family" Found in Our Sample 
(in alphabetical order) 

. add 
ADD 

add-oil 



added 

addend 

addends 



adding 
*j Adding 



addition 

Addition , 
ADDITION / 
addition-subtractioh 
additional / 
additions 
additive 

additive-inverse/ 



/ 



-additives— 
AdditiVesi|A;- • / 



adds 
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. Table 2 
\ 

Relationships Among Membefife of a Word Family 
In Terms of Target. Words and "Immediate Ancestors" 



Target Word 



Immediate ^ 
Ancestor 



Affix 



Relationship 



aQu ' 






AAA 


aaa • • 

f 




eta a OX JL 








nil 


\ 


S \ 


aUU , 




dQUcIia 


aUU ^ 




addends 


addend 




adding: 


add 




Adding 


adding 




addition 


add 


' it ion 


-Addition 


addition 




ADDITION 


addition 




addition-subtraction 
addition-subtraction 


addition 




sijbtraction 




additional 


addition 


al 


additions 


addition 




additive 


^ addition 


ive 


additive- inverse 


additive 


\ 


additive-inverse 


inverse 


additives 


additive 




Additives 


additives ^ 




adds 


add 





Morphologically basic word 
capi'taiization 
compound (first member), 
compound (second member) 
regular inflection 
suffixation 
regular inflection 
regular inflection 
» capitali^zation 
suffixation 
capitalization 
capitalization 
compound (first member) 
compound (second member) 
suffixation 
regular inflection 
suffix replacement 
compound (first member) 
compound (second memoer) 
regular inflection 
capitalization 
regular inflection 
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V Table 3 
Categories of Relationships Among Words 



Examples 



Category ' ^ 


Target Word^ 


Iiranadiate. 
'Ancestor 


noTpno±ogica±xy oasic v»o/u 


CL\1\X 




Simple capitalization 


TK^ nip 


tnxnic 


Alliernate spellings ' " 


cart-horse 


carthorse 


Alternate: pronunciations 


fishin^ 


fishing 


Alternate form of word 


soya 


soy 


Alternate ^Eorp: with s 


towards 


toward 


Regular inflections 


walks 


walk 


Irregular inflections ; 


went 


go • 


Regula^coinparatives & superl^'tives 


taller 


tall 


Irregular comparatives & superlatives 


best 


good 


Suffixation 


frustration 


frustrate 


Prefixation 


unknown 


known 


Gompounds and contractions 


farmhand 


farm, hand 




can • t 


can, not 


Truncations 


rhind 


rhinoceros 


Idiosyncratic inorolu^ic^l relationships 


prophesy 


prophecy 
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Table 4 * 

f 



Analysis of the Word Frequency Book by Word-Relatedness Categories 



Category \. 


Sample Sample Corpus Population Population 

N ' % N %jr N 


A. Categories tha 




«7ould be included in most definitions of *Vard." 



Morphologically basic 


846 


11.65 . 


•10,108 


. 7.46 


45,453 


Idios3mcractic relation 


72 


1.00 


860 


1.01 


6,167 


Suffijcation 


722 


9.94 


• 8,626 


7.62 


46,431 


JJrefixation 


233 


3.21 . 


2,784 


4.01 


24,457 


Compounding & contractions- 


1,038 - 


14.30 


12,402 


17.23 


105,044 


Truncations 


-16 


0.22 


191 


0.19 


1,144 


Abbreviations 


12- 


0.17 


143 


0.15 


897 


' Subtotal 


2,939 


40.48 


35,115 ■ 


37.66 


229,593 


B. Categories that would have their 


own separate entries 


in most 


dictionaries 


Irregular inflections 


49 


0.67 


• 585 


0.25 


1,528 


Irregular comparative & 


,1 


0.01 


. 12 


0.002 


13 


superlative 












Alternate forms of words 


8 


0.11 


96 


0.18 


1,072 


Alternate forms with £ 


8 


0.11 


96 


0.11 


693 


Semantically irregular pi. 


8 


0.11 


96 


0.=02 


136 


••Scientific plurals" 


2 


0.03 


24 


0.02 


145 


Subtotal 


76 


■ 1.05 


907 


0.59 


3,587 


C. Categories that would not normally occur 


as separate dictionary entries. 


Regular infleSMons 


1,553 


21.39 


18,555 


16.37. 


' 99,547 


^Regular comparative & 


46 


0.63 


• 550 


0.51 


3,149 


superlative 












Incorrect regular infl. 


3 


0.04 


36 


0.07 


450 


Simple, capitalization 


618 


8.51 


7,384 


8.51 


51,906 


-•Alternate spellings 


136 


1.87 


1,625 


3.05 


18,584 


Alternate pronunciations 


87 


1.20 


1,039 


1.21 


7,381 


Subtotal 


2,443 


33.65 


29,188 


29.69 


181,017 



« 
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Table 4 (Cont'd) 



Category 


Sample 
N . 


Sample 
% " 


Corpus 
N 


Population 

% 


Population 
N 


D. Catego 


ries -relating to proper names 




Basic proper names 


929 


. 12.80 


11,099 


14.78 


90,107 


Derived proper names ♦ 


88 


1.21 


1,051 


1.18 


7,215 


Capitalizations homd- 


76 


1.05 


.'908 


"~ 0.67 > 


4 114 


graphic with p^n.'s 










) 


Inflectional , and other 


302 


4.16 


/3,608 


4.74 




varients o"f p.n.'s 












Subtotal \ ' 


1,395 


19.21 


16,667 


21.38 


130 305 


E. Categories not: norinally counted as 


words X ^ 




Formulae & numbers 


339 


5.50 


4,767 


5 .-8.9 


35,891 


Compoonds containing 


41 


6,56 


490 


. 0.80 


4,, 894 


. nunijers 












.Nbnwords 


147 


2.02 


1,756 


3.35 


20,444 


Foreign words 


46 


0.63 


550 


0.92 


5,618 


Subtotal 


633 


' ■ 8.80 


7,563 


10.97 


66,847 ' 


F. 


Miscellaneous cal 


tegories 






Errors' in OTB 


6 


.0.08 


• 6 






(duplicated entries) 












Ambiguous words 


19 


0.26 


227 


0.05 


. 292 


(excluding proper names) 












Ambiguous proper names 


2 


0.03 


24 


0.004 


27 


Missing ancestores added 


203 


2.80 


2,425 






2nd meanings of ambiguous 


51 


0.70 


609 







items added 



I2i 
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Table 5 

Derived Words Arranged by Relationship Category 

and Degree of Semantic Relationships ' 



Relationship Categories 





Suffix 


Prefix 


Compound 


IdiosyncrjrEic^ 


Total 


SEM 0 


26,840 . 


• 12,999' 


. 21,773 ■ 


519 


62,131 


SEM 1 


"6,289 


. 4,051 


28,391. • 


- 666 


. 39,597 


sm 2 


6,904 


.' 3,476 


-•ir..-^6,(ji3-.: 


879 


37,292 


SEM 3 


3,717 


2,630 


17-, 817 


2,435 - 


26,599 


SEM 4 


1,413 


■ 636-. 


4,675 


1,162 


7,886 


SEM 5 


1,269 


666 - 


6,155 


,505 


8,595 


SEM 0-2 


40,033 


20,526 


76,397 


2,064 


139,020 


SEM 3-5 


6,399 


3,932 


28,647 


, 4,102 


43,080 



o 

J 
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Table 6 " 
Derived Words Arranged by Relationship Cate^ory^ 
and Degree of Semantic Relation ^.<ip 
(Minimal Semantic Distance Baseid on Most Similar Meanings) 



Relationship Categories 



Suffix Prefix Compound Idiosyncratic Total 



SEM 0 


28,491 


13*555 
4,296 


22,436 


807 


65,289 


SEM^r 


6,780 


32,132 


627 


43,835 


SEM 2 


6,562' 


3,523 


25,223 


1,178 


36,486 


SEM 3 


2,646 


1,828 


16,387 


2,774 


23,635 


SEM 4 


740 


456 


2,765 


'-673 


4,634 


SEt-1 5 


64 


13 


2,820 


65 


2,962 



SEM 0-5 41,833 21,374 79,791 2,612 145,610 

SEM 3-5 3,450 2,^97 21,972 3,5-12-^ 31,231 



12^ 



•rt 

H 

to 

G 
M 

H 
O 
O 
J3 
O 
CO 

a 

CO 



Table 7 

Some Estimates of the Number of Words in English 



Main Entries 



» b ~ c c 

Basic Words ^ Basic Words Total Words 



Author ^s original - :J40,000 

estimate 

Estimated numbed* 37., 707 

in the OTB a, 

I 

Estimated number 243,136 
in printed school 
English 



12,300 



16,655' 



88,533' 



166,247 ' ^ 370,265 



31,095 



192,909 



50,765 • 
344,572 



S/ebster's Third (estimated by Dupuy, 1974) 
^Dupuy (1974) 
Seashore & Eckersoh (1944) . * 

.Seashore & Eckerson (1944) (with revision by Lorge & Chall, 1963) 
Morphologically basic^ words plus semantically opaque (SEM 3, 4, 5) derivaties 



Basic Words 



99,6dO 
18,037 / ' 
91,466 



ERIC 



' Table 8 

Polysemy Among Morphologically Basic Words 



Polysemy Measure 



Extent* of Polysemy 



Meaa Number .of Meanings 
Per Morphologically 
Basic Word' 



Total Number of Distinct 
^ Meanings Among 
Morphologically 
Basic Words 





WFB 


Population 


WFB 


Population 


SEM 0 


4.218 '- 




; 42,636 


A 


SEM 1 


. ' 2.872 




29,030 




SEM 2 


2.038 


1.615 


20,600 


73,417 


^EM 3 


1.417 


1.316 


14,323" 


59,821 9 


^EM 4 


1.231 




12,443 




Homographs 


1.103 




11,149 




Phrasal' ahd 


0.436 




4,407 





idiomatic entries 
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- .Table 9 

Couxit of Basic Words Incorporating Honophony 





N iraber 


of Words 




WFB 


Population 


"Semantically distinct" defined with 
SEM 2 cut-off 




, 


Number of distinct meanings of 
morphologically basic v/ords 


20,600 


73,417 


Nuinl/er of distinct derived words 


. 4,779 


31,821 


Total 


25,379 


105,238 


"Semantically distinct" defined with 
SEM 3 cut-off^ 

•> 






Number of distinct meanings of 
morphologically basic words 


14,323 


59,821 


Number of, distinct derived' words 


1,039 


7,596 


Total 


15,362 


67 ,417 



f 
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^ , Table 10 ^ 
Cumulative Distribution of Words by Frequency 



FrGauencv 


Number 


of Words in Printed Scho9l English 
at or Above that Frequency 


(in terms 
' o f U) 


. * 

Graphically 

Distinct 

Types 


Morphologically Basic// 
Words and Semantically 
Opaque Dgrivatives// 


Semantically 
Transparent , 
Derivatives 


100 00 

\ 




555 /' 


. 55 


31 '623 




1,225 

•» /' 
2,450 


175 


10. 000 


5,480 


455 


3.1623 


11,980 


, 4,330 


1,290 


1.0000 


- ■ 24,108 


6,700, 


3 ,300 


.31.623 


" 44,743 


10,400 

* 


7,150 


.10000 


76,757 


15,350 


13,400 


.03162 


122,045 


21,700 


23, OOP 


.00132 


304,803 


■ 46,300 


65,000 


.00003 


^ 512,886 


75,000 


116 ,000 


o.oooo 


609,606 


88,500 


139,000 



1 10- 



Table 11 

I 

j The Average Composition of a Word Fan\ily 



Number of Words 

Type of Words 



Definition A 


Definition B 




1.00 

.15 

-49 


' 1.00 


Base word (a morphologically basic word or semantically opaque 

derivative) 
SEM 4 derivatives 
SEM 3 derivatives 


.65 





Total semantically obscure derivatives (SEM 3, SEM 4) 


.69 
- .73 
1.15 


.42 
.45 
.70 


SEM 2 derivatives 
SEM 1 derivative$ 
SEM 0 derivatives 


2.57 


1.57 


Total semantically transparent derivatives (SEM 0-SEM 2) 


.04 
.07 

1.90 


.02 • 
.02 

1.16 


Truncations and abbreviations 

Irregular inflections, comparatives and superlatives; alternate 

forms of .words; semantically irregular plurals 
Regular inflections, , comparatives and superlatives 


2.00 


1.22 


Total inflections, abbreviations and truncations . 


.94 
^ .34 
.14 


.58 
.21 
.08 


•Simple capitalizations 
Alfernate spellings 
Alternate pronunciatiQps 


1.42' 


«87 


To*'al minor variations in form 


7.64 


4.66 


Total family size in graphically distinct types 
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Figure Caption 



Figure 1* Graphic Representation of Relationships Among Words 



